"* Tell me who your neighbours are, and I'll tell you who you are!\n",
"* K-Nearest Neighbors (kNN) is an algorithm that classifies a data point based on this principle. This means it classifies a given data point according to the classification labels of surrounding data points\n",
"* It uses Euclidean distance between the data point and other data points to determine who are its neighbours\n",
"* The k represents the number of neighbours to include in the classification problem\n",
"* kNN is well suited for classification tasks where the relationship between the features are complex and hard to understand."
"classifier = KNeighborsClassifier(n_neighbors=3) # this is the k value\n",
"classifier.fit(X,y)\n",
"\n",
"pred = classifier.predict(np.array([[2,8]]))\n",
"\n",
"if pred == 0:\n",
" print(\"Data point is blue\")\n",
"else:\n",
" print(\"Data point is red\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's apply this to 'real' data!\n",
"You can obtain the dataset used in this lecture here: https://www.kaggle.com/rakeshrau/social-network-ads/data. It is a 'categorical dataset to determine whether a user purchased a particular product'."
"This is another way of selecting the features you want where data.features represents your x values, and data.target represents your y values. These are numpy arrays so there is no need to convert them as we did before!"
"* Although we define train/test splits, there is no explicity training\n",
"* Lazy Learning means there is no explicit training, and no generalisation based on the training data\n",
"* kNN keeps all the training data for inference\n",
"* Making the predictions on the test data is rather slow (because the distance between a test data point and all the training data points must be calculated)\n",
"* kNN uses non-parametric learning: no parameters are to be learned about the data because it makes no assumptions about data distribution"
"Here, we are finding the optimal k value by using cross_val_score. You can find more info on this function here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html"
"Add the feature 'Gender' to the training set, and see if the model improves?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
## K-Nearest Neighbors (kNN) Classifier
* Tell me who your neighbours are, and I'll tell you who you are!
* K-Nearest Neighbors (kNN) is an algorithm that classifies a data point based on this principle. This means it classifies a given data point according to the classification labels of surrounding data points
* It uses Euclidean distance between the data point and other data points to determine who are its neighbours
* The k represents the number of neighbours to include in the classification problem
* kNN is well suited for classification tasks where the relationship between the features are complex and hard to understand.
y=np.array([0,0,0,0,0,0,1,1,1,1,1,1])# 0: blue class, 1: red class
```
%% Cell type:code id: tags:
``` python
plt.plot(xBlue,yBlue,'ro',color='blue')
plt.plot(xRed,yRed,'ro',color='red')
plt.plot(2,8,'ro',color='green',markersize=10)
plt.axis([-0.5,10,-0.5,10])
```
%% Cell type:code id: tags:
``` python
classifier=KNeighborsClassifier(n_neighbors=3)# this is the k value
classifier.fit(X,y)
pred=classifier.predict(np.array([[2,8]]))
ifpred==0:
print("Data point is blue")
else:
print("Data point is red")
```
%% Cell type:markdown id: tags:
### Let's apply this to 'real' data!
You can obtain the dataset used in this lecture here: https://www.kaggle.com/rakeshrau/social-network-ads/data. It is a 'categorical dataset to determine whether a user purchased a particular product'.
This is another way of selecting the features you want where data.features represents your x values, and data.target represents your y values. These are numpy arrays so there is no need to convert them as we did before!
%% Cell type:code id: tags:
``` python
data.features=data[["EstimatedSalary","Age"]]
data.target=data.Purchased
```
%% Cell type:markdown id: tags:
Here is the normalisation step. This is very important as we don't want huge or tiny values to obscure meaning of the data.
* Although we define train/test splits, there is no explicity training
* Lazy Learning means there is no explicit training, and no generalisation based on the training data
* kNN keeps all the training data for inference
* Making the predictions on the test data is rather slow (because the distance between a test data point and all the training data points must be calculated)
* kNN uses non-parametric learning: no parameters are to be learned about the data because it makes no assumptions about data distribution
Here, we are finding the optimal k value by using cross_val_score. You can find more info on this function here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html