💻 K-Nearest Neighbours

January 31, 2022

GitHub Link for Code Visuals
Soowan Choi

K-Nearest Neighbours

(Classifying Iris Species)

Classification - Supervised - Instance Based - Non Parametric

Tuning K Hyperparameter with Cross-Validation

Next: Feature Engineering using Decision Tree for High Dimensional Datasets

iris

1) Problem

a. Classification

flower: iris-setosa, iris-versicolor, iris-virginica

b. Explore

150 samples (rows), 4 features (columns), 1 target (column)
balanced class: 50 iris-setosa, 50 iris-versicolor, 50 iris-virginica
no missing values

Reference: https://www.kaggle.com/datasets/uciml/iris

1) a. Load the data…

#load the dataset
import pandas as pd     #for data organization
url = 'https://raw.githubusercontent.com/swanscodex/swanscodex/main/Iris.csv'   #github url to csv file
df = pd.read_csv(url)   #stored as dataframe

1) b. Explore the problem…

#how many samples and features?
print(f'there are {df.shape[0]} rows and {df.shape[1]} columns in this dataset \n')
df.head()               #prints the first 5 rows of data

#how many samples of each class?
print(f'there are {len(df.Species.unique())} types of flowers to classify in this dataset: \n')

#print the names of each class and sample length
for i in range(len(df.Species.unique())):
  print(f'{i+1} = {df.Species.unique()[i]}, sample data = {len(df[df.Species == df.Species.unique()[i]])}')  

#data statistics
df.describe() 

#how many missing values in dataset?
df.info()   #no missing values in each column

2) Data

a. Clean
b. Xy Split
c. Test/Train Split
d. Standardize

2) a. Clean the data…

#convert categorical data into numerical data
from sklearn.preprocessing import LabelEncoder

mappings = list()          #create an empty list to create dictionary mapping categorical to numerical values
encoder = LabelEncoder()   #create an instance of label encoder

df['Species'] = encoder.fit_transform(df['Species'])                           #encode to numerical values
mappings_dict = {index: label for index, label in enumerate(encoder.classes_)} #create a dictionary for mapping categorical to numerical values
mappings.append(mappings_dict)                                                 #store dictionary for each column into single list

df.head()                  #show the encoded dataframe 

print(df.columns[-1],'=',mappings[0])       #print the column names and the associated dictionary mapping

2) b. Split the data into X features and y labels…

#Xy split
feature_data = df.iloc[:, 1:-1]     #split dataframe to get feature data X
target_data = df.iloc[:,-1]         #split dataframe to get target data y

feature_data.head(3)                #show the split dataframe of feature data X

2) c. Split the data into training and tesing sets…

#test/train split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(feature_data, target_data, test_size = 0.2, random_state=1)  #train set = 80%, test set = 20%

len(X_test) / len(df)    #test set is split such that it is 20% of the entire dataset

2) d. Standardize the data…

#standardize 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit(X_train)  #fit the scaler on TRAINING dataset only 
#(do not use test set as it leaks info on test data distribution and could lead to overestimating the model performance)

X_train_sc = scaler.transform(X_train)  #transform train/test set with the scaler
X_test_sc = scaler.transform(X_test)

X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.columns, index = X_train.index) #convert scaled train/test sets to dataframe
X_test_sc = pd.DataFrame(X_test_sc, columns = X_test.columns, index = X_test.index)

X_train_sc = X_train_sc.dropna() #drop null values
X_test_sc = X_test_sc.dropna()

#X_train_sc.std(axis='index')                                #to verify standard deviation for each column is 1
X_train_sc.mean(axis='index')                                #to verify mean for each column is 0 

Visualize the data…

#visualize data
import matplotlib.pyplot as plt

#combine features and targets of training data into single dataframe for visualization
df_visual = X_train_sc.join(y_train)

setosa = df_visual.where(df_visual.Species == 0)    #only setosa data
versi = df_visual.where(df_visual.Species == 1)     #only versicolor data
virgi = df_visual.where(df_visual.Species == 2)     #only virginica data

plt.scatter(setosa['SepalLengthCm'], setosa['SepalWidthCm'], label="Setosa")            #Compare same two features for each class
plt.scatter(versi['SepalLengthCm'], versi['SepalWidthCm'], label="Versicolor")
plt.scatter(virgi['SepalLengthCm'], virgi['SepalWidthCm'], label="Virginica")                 
plt.legend()
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")
plt.title("Setosa-Versicolor-Virginica Data Visualization");

3) Model: KNN Classifier

a. Cross Validate (Fit, Train, Predict -> Parameter)
b. Evaluate using Test Data

3) a. Cross-validation to select best K for generalizing on unseen data…

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate            #import the cross validation score function from the sklearn library
import matplotlib.pyplot as plt

k = range(1,101)                                              #sweep through k(number of neighbours) from 1 to 100
scores_train = []                                             #create an empty list to store the training accuracy score from cross validation
scores_val = []                                               #create an empty list to store the validation accuracy score from cross validation  
for i in k:
  knn = KNeighborsClassifier(n_neighbors=i)                   #parameter - tune the hyperparameter k in knn
  scores = cross_validate(knn, X_train.iloc[:,:-1], y_train, cv=5, return_train_score = True) #split training data into 5-fold TRAIN/VALIDATION set
  scores_train.append(scores['train_score'].mean())           #store the mean training accuracy from cross-validation into the list
  scores_val.append(scores['test_score'].mean())              #store the mean validation accuracy from cross-validation into the list

plt.plot(k,scores_val)                                        #plot only the validation accuracy with the changing k neighbour parameter             

plt.xlabel("k-Neighbours")                                    #label the graph
plt.ylabel("Cross-Val Validation Accuracy (Mean)")
plt.title("Cross-Val Validation Accuracy (Mean) vs k-Neighbours")

print("Max cross val validation accuracy is {}% at k parameter of {}".format(round(max(scores_val)*100,2),scores_val.index(max(scores_val))+1))

#plot training score accuracy and validation score accuracy to find optimal range of k values:
plt.plot(k[0:51],scores_train[0:51], label = 'Training Accuracy')      #plot the first 50 training score accuracy with model complexity
plt.plot(k[0:51],scores_val[0:51], label = 'Validation Accuracy')      #plot the first 50 validation score accuracy with model complexity

plt.legend()                                                           #add legend and label the graph
plt.xlabel("Model Complexity (k-Neighbours)")
plt.ylabel("Cross-Validation Accuracy (Mean)")
plt.title("Cross-Validation Accuracy (Mean) vs k-Neighbours")
plt.show()

From the graph above, the k values between 0 to 15 seems to overfit the data (high training accuracy but low validation accuracy), whereas the k values greater than 18 seems to underfit the data as both training and validation accuracy decrease (below ~95%).

for j in range(1,100):
  if scores_val[j] > 0.93 and scores_val[j] > scores_train[j]:   #we want validation accuracy to be larger than training to avoid overfitting
    print("at k = {}, cross val validation accuracy = {}% and training accuracy = {}%".format(j+1, round(scores_val[j]*100,2), round(scores_train[j]*100,2)))

The best k-Neighbour parameter from cross validation seems to be 15, as it produced the highest validation accuracy of 95.0% without overfitting/underfitting the data.

Bias-Variance tradeoff:

a low k-Neighbour value would result in high variance but low bias, as the model complexity increases with lower k-Neighbours.
a high k-Neighbour value would result in low variance but high bias, as the model complexity decreases with higher k-Neighbours.

This is a tradeoff as high variance leads to overfitting (training accuracy »> validation accuracy) vs a high bias leads to underfitting (low training and validation accuracy).

3) b. Evaluate the model on unseen test data…

#evaluate model:
#highest validation acccuracy of 95.0% with k=15 and # of features = 4 using standardized data on KNN model
import numpy as np
knn = KNeighborsClassifier(n_neighbors=15)             #fit the model: tune the hyperparameter k in knn
knn.fit(X_train_sc,y_train)                            #train the model using training data
knn.predict(X_test_sc)                                 #test the model using held out test set

accuracy = np.sum(y_test == knn.predict(X_test_sc)) / y_test.size
print ("Accuracy: ", round(accuracy * 100,2), "%")

#using scikit-learn's customized function to test accuracy:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, knn.predict(X_test_sc))
accuracy = round(accuracy*100,2)
print ("Accuracy: ", accuracy, "%")

#confusion matrix to visualize correct predictions
from sklearn.metrics import confusion_matrix

y_predicted = knn.predict(X_test_sc)
con_mat = confusion_matrix(y_test,y_predicted)

%matplotlib inline
import seaborn as sea
plt.figure(figsize=(5,3))
sea.heatmap(con_mat,annot=True)
plt.ylabel('PREDICTED \n (Virginica | Versicolor | Setosa)')
plt.xlabel('ACTUAL \n (Setosa | Versicolor | Virginica)')
plt.show()

The best k-Neighbours for this Iris Species dataset was found to be 15, which produced the highest validation score of 95.0%.

Using k-Neighbours of 15 with 4 features on the unseen test dataset, the testing score results in 96.67%.

However, note that due to the small sample size, the optimal k-Neighbours might not be an accurate representation
- e.g., k=3 seems to produce testing score of 100%, even though it produced a lower validation score of 91.67%
- e.g., k=15: testing score = 96.67% (validation score = 95%)

Note: Accuracy % can be used to evaluate this model (instead of F1 Score or AUC) as this is a balanced dataset (50 samples for each flower)

To predict on out-of-sample data, we use the entire dataset (Not the X_train, y_train split dataset) as more data creates a more accurate model.

We set the k-Neighbours to 15…

#to predict using individual flower features (out-of-sample data)
knn = KNeighborsClassifier(n_neighbors = 15)
knn.fit(feature_data,target_data)
knn.predict([[6.7,3,4,1]])

-수완-