K-Nearest Neighbours Classifier
KNN is a very simple machine learning algorithim. Given a distance parameter and nearest neighbours parameter. The algorithim use premise that points that are related to one class are closely associated with each other. The algorithim will group each random data points place hyperplane using its K-Nearest Neighbours.
Import Preliminaries¶
%matplotlib inline
%config InlineBackend.figure_format='retina'
# Import modules
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import os
import pandas as pd
import sklearn
import seaborn as sns
import warnings
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
# Set pandas options
pd.set_option('max_columns',1000)
pd.set_option('max_rows',30)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# Set plotting options
mpl.rcParams['figure.figsize'] = (9.0, 3.0)
# Set warning options
warnings.filterwarnings('ignore');
Import Data¶
# Import data
breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target
# Train-test split the data
train_x, test_x, train_y, test_y = train_test_split(X,y)
# View the training dataframe
pd.DataFrame(train_x, columns=breast_cancer['feature_names']).head(5)
Targets¶
# Plot a barplot of the target clasees
pd.Series(train_y).value_counts().plot.barh(grid=False, color=['#3F8CD5','#73DEB9'], width=0.25,edgecolor='w')
plt.title('Target Outcomes')
plt.ylabel('Class')
plt.xlabel('Measure of Disease Progression');
Fit the Model¶
# Fit the intial knn model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(train_x, train_y);
Model Evaluation¶
Cross Validation Score¶
# View the cross validation score of the intial model
scores = cross_val_score(knn_model, train_x, train_y, cv=10,
scoring='accuracy')
print(f'Cross Validation Score: {scores.mean():.5f}')
Confustion Matrix¶
# Training Confusion Matrix
cmatrix = pd.DataFrame(confusion_matrix(train_y, knn_model.predict(train_x)))
cmatrix.index.name = 'class'
cmatrix['result'] = 'actual'
cmatrix.set_index('result', append=True, inplace=True)
cmatrix = cmatrix.reorder_levels(['result', 'class'])
cmatrix = cmatrix.stack()
cmatrix = pd.DataFrame(cmatrix)
cmatrix.columns = ['prediction']
cmatrix.unstack()
Parameter Tuning¶
# Define paraameter range and score lists
neighbors_range = np.arange(1,30)
train_score = []
test_score = []
# Train a knn_model for every neighbour value in our list
for k in neighbors_range :
knn_model=KNeighborsClassifier(n_neighbors = k).fit(train_x,train_y)
train_score.append(cross_val_score(knn_model, train_x, train_y, cv=10, scoring='accuracy').mean())
test_score.append(cross_val_score(knn_model, test_x, test_y, cv=10, scoring='accuracy').mean())
# Plot our results
mpl.rcParams['figure.figsize'] = (9.0, 6.0)
plt.plot(neighbors_range,train_score,label="Train",linewidth=2, color='#3C8AD4')
plt.plot(neighbors_range,test_score,label="Test", linewidth=2,linestyle='--', color='#71DEB8')
plt.legend()
plt.title('KNN Nearest Neighbours Model')
plt.xlabel('Number for K')
plt.ylabel('Accuracy');
Final Model¶
# Fit the final knn model
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(train_x, train_y);
scores = cross_val_score(knn_model, train_x, train_y, cv=10,
scoring='accuracy')
print(f'Cross Validation Score: {scores.mean():.5f}')
Predict Results¶
# Viwe some predictions
pd.Series(knn_model.predict(test_x)).head(n=7)
General Notes¶
- Good baseline model
- The algorithm finds the closest points in the training dataset (it's neighbors).
- At minimum use 1 + number of your target classes in your dataset for your value of k
- The algorithm uses voting to assign the label, meaning it calculates the distance for itself and all other points. Then calculate its class by comparing its k - nearest data points. It assigns itself the class that is most frequent between its neighbors.
- The algorithm is computation intensive during the prediction portion of the model
- A small number of neighbor often work the best
- The algorithm often does not work well on data with high dimensionality and high sparsity (lots of zero)
Author: Kavi Sekhon