Skip to content
Snippets Groups Projects
Commit 7c34c7a7 authored by ashepley's avatar ashepley
Browse files

Upload New File

parent 0a067f59
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
## K-Nearest Neighbors (kNN) Classifier
* Tell me who your neighbours are, and I'll tell you who you are!
* K-Nearest Neighbors (kNN) is an algorithm that classifies a data point based on this principle. This means it classifies a given data point according to the classification labels of surrounding data points
* It uses Euclidean distance between the data point and other data points to determine who are its neighbours
* The k represents the number of neighbours to include in the classification problem
* kNN is well suited for classification tasks where the relationship between the features are complex and hard to understand.
%% Cell type:code id: tags:
``` python
import numpy as np
from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
```
%% Cell type:code id: tags:
``` python
xBlue = np.array([0.3,0.5,1,1.4,1.7,2])
yBlue = np.array([1,4.5,2.3,1.9,8.9,4.1])
```
%% Cell type:code id: tags:
``` python
xRed = np.array([3.3,3.5,4,4.4,5.7,6])
yRed = np.array([7,1.5,6.3,1.9,2.9,7.1])
```
%% Cell type:code id: tags:
``` python
X = np.array([[0.3,1],[0.5,4.5],[1,2.3],[1.4,1.9],[1.7,8.9],[2,4.1],[3.3,7],[3.5,1.5],[4,6.3],[4.4,1.9],[5.7,2.9],[6,7.1]])
y = np.array([0,0,0,0,0,0,1,1,1,1,1,1]) # 0: blue class, 1: red class
```
%% Cell type:code id: tags:
``` python
plt.plot(xBlue, yBlue, 'ro', color = 'blue')
plt.plot(xRed, yRed, 'ro', color='red')
plt.plot(2,8,'ro',color='green', markersize=10)
plt.axis([-0.5,10,-0.5,10])
```
%% Cell type:code id: tags:
``` python
classifier = KNeighborsClassifier(n_neighbors=3) # this is the k value
classifier.fit(X,y)
pred = classifier.predict(np.array([[2,8]]))
if pred == 0:
print("Data point is blue")
else:
print("Data point is red")
```
%% Cell type:markdown id: tags:
### Let's apply this to 'real' data!
You can obtain the dataset used in this lecture here: https://www.kaggle.com/rakeshrau/social-network-ads/data. It is a 'categorical dataset to determine whether a user purchased a particular product'.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
def check_NaN(dataframe):
print("Total NaN:", dataframe.isnull().values.sum())
print("NaN by column:\n",dataframe.isnull().sum())
return
```
%% Cell type:code id: tags:
``` python
data = pd.read_csv(".\datasets\Social_Network_Ads.csv")
```
%% Cell type:code id: tags:
``` python
data.head()
```
%% Cell type:code id: tags:
``` python
check_NaN(data)
```
%% Cell type:markdown id: tags:
This is another way of selecting the features you want where data.features represents your x values, and data.target represents your y values. These are numpy arrays so there is no need to convert them as we did before!
%% Cell type:code id: tags:
``` python
data.features = data[["EstimatedSalary","Age"]]
data.target = data.Purchased
```
%% Cell type:markdown id: tags:
Here is the normalisation step. This is very important as we don't want huge or tiny values to obscure meaning of the data.
%% Cell type:code id: tags:
``` python
data.features = preprocessing.MinMaxScaler().fit_transform(data.features)
```
%% Cell type:markdown id: tags:
#### kNN is a Lazy Learner!
* Although we define train/test splits, there is no explicity training
* Lazy Learning means there is no explicit training, and no generalisation based on the training data
* kNN keeps all the training data for inference
* Making the predictions on the test data is rather slow (because the distance between a test data point and all the training data points must be calculated)
* kNN uses non-parametric learning: no parameters are to be learned about the data because it makes no assumptions about data distribution
%% Cell type:code id: tags:
``` python
x_train, x_test, y_train, y_test = train_test_split(data.features,data.target, test_size=0.2,random_state=42)
```
%% Cell type:markdown id: tags:
Here, we are finding the optimal k value by using cross_val_score. You can find more info on this function here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
%% Cell type:code id: tags:
``` python
k_scores = []
for k in range(1,100):
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn,data.features,data.target,cv=10,scoring='accuracy')
k_scores.append(scores.mean())
optimal_k = np.argmax(k_scores)
print("Optimal k with cross-validation: ", np.argmax(k_scores))
```
%% Cell type:code id: tags:
``` python
classifier = KNeighborsClassifier(optimal_k)
classifier.fit(x_train, y_train)
predictions = classifier.predict(x_test)
```
%% Cell type:markdown id: tags:
#### Evaluation using mean squared error and accuracy
%% Cell type:code id: tags:
``` python
print(confusion_matrix(y_test, predictions))
```
%% Cell type:code id: tags:
``` python
print("Accuracy:",str(accuracy_score(y_test, predictions)*100)+"%")
```
%% Cell type:code id: tags:
``` python
#mean squared error
print("Mean squared error: ",str(np.mean((predictions - y_test) ** 2)*100)+"%")
```
%% Cell type:markdown id: tags:
### Visualisation
%% Cell type:code id: tags:
``` python
from matplotlib.colors import *
xs, ys = x_test, y_test
X1, X2 = np.meshgrid(np.arange(start = xs[:,0].min() - 1,stop = xs[:,0].max() + 1,step = 0.01),
np.arange(start = xs[:,1].min() - 1,stop = xs[:,1].max() + 1,step = 0.01))
plt.contourf(X1,X2, classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('orange','grey')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i, j in enumerate(np.unique(ys)):
plt.scatter(xs[ys==j,0],xs[ys==j,1],
c=ListedColormap(('orange','grey'))(i),label = j)
plt.title('Train Set')
plt.xlabel('Estimated Salary')
plt.ylabel('Age')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
### Exercise
Add the feature 'Gender' to the training set, and see if the model improves?
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment