Data Mining Report

Dataset Information

The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We typically train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000.

https://www.kaggle.com/datasets/nishan192/letterrecognition-using-svm?resource=download

Matlab code

data = readtable("letter-recognition.csv");
rng("default");
cv = cvpartition(data.letter,HoldOut=0.30);
trainidx = training(cv);
testidx = test(cv);
traindata = data(trainidx,:);
testdata = data(testidx,:);
knnmodel = fitcknn(traindata,"letter",NumNeighbors=5,Distance="euclidean");
predictedletters = predict(knnmodel,testdata);
testloss = loss(knnmodel,testdata)
resubloss = resubLoss(knnmodel)
confusionchart(testdata.letter,predictedletters,RowSummary="row-normalized")

Code Breakdown

data = readtable("letter-recognition.csv");

This code imports the data from the spreadsheet letter-recognition.csv and stores it in a table variable called data.

rng("default");

This code uses the rng function to set the random number generator to its default behavior, which is useful for reproducibility of results.

cv = cvpartition(data.letter,HoldOut=0.30);
trainidx = training(cv);
testidx = test(cv);
traindata = data(trainidx,:);
testdata = data(testidx,:);

This code creates a partitioned dataset using the cvpartition function, with 30% of the data held out for testing and the remaining 70% for training.

knnmodel = fitcknn(traindata,"letter",NumNeighbors=5,Distance="euclidean");

This code uses the fitcknn function to train a K-Nearest Neighbors (KNN) classification model, specifying 5 neighbors and the Euclidean distance metric, using the training data and "letter" as the response variable.

predictedletters = predict(knnmodel,testdata);

This code uses the trained KNN model (knnmodel) to make predictions based on the test data and stores the predicted letters in the variable predictedletters.

testloss = loss(knnmodel,testdata)

This code computes the testing loss (measure of misclassification that incorporates the probability of each class) of the model on the test data set.

resubloss = resubLoss(knnmodel)

The code calculates the resubstitution loss (the loss when the training data is “resubstituted” into the model) of the model.

confusionchart(testdata.letter,predictedletters,RowSummary="row-normalized")

This code creates a confusion chart to visualize the comparison between the actual and predicted letters, providing a row-normalized summary of the results.

Output

testloss = 0.0510
resubloss = 0.0286