Skip to content
Snippets Groups Projects
Commit 0a067f59 authored by ashepley's avatar ashepley
Browse files

Upload New File

parent 3c7cf8f8
Branches
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
### Non-linear Support Vector Machines
You can obtain the dataset used in this lecture here: https://www.kaggle.com/rakeshrau/social-network-ads/data. It is a 'categorical dataset to determine whether a user purchased a particular product'.
%% Cell type:code id: tags:
``` python
#import
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import *
import pandas as pd
from sklearn.model_selection import *
from sklearn.linear_model import LinearRegression
from sklearn.metrics import *
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
```
%% Cell type:markdown id: tags:
Here are the functions you used in the previous lectures
%% Cell type:code id: tags:
``` python
def load_data(DATASET_PATH):
return pd.read_csv(DATASET_PATH)
def check_NaN(dataframe):
print("Total NaN:", dataframe.isnull().values.sum())
print("NaN by column:\n",dataframe.isnull().sum())
return
def one_hot_encode(dataframe, col_name):
dataframe = pd.get_dummies(dataframe, columns=[col_name], prefix = [col_name])
return dataframe
```
%% Cell type:markdown id: tags:
Load the dataset, and have a look at its contents
%% Cell type:code id: tags:
``` python
DATASET_PATH = './datasets/Social_Network_Ads.csv'
```
%% Cell type:code id: tags:
``` python
#create pandas object
ads = load_data(DATASET_PATH)
ads.head()
```
%% Cell type:markdown id: tags:
We'll be training our SVM on the Age and Estimated Salary features, to predict whether the user purchased a product based on an ad or not.
%% Cell type:code id: tags:
``` python
chosen_columns = ['Age','EstimatedSalary','Purchased']
subset = ads.filter(chosen_columns)
subset.head()
```
%% Cell type:markdown id: tags:
Always check whether your subset contains NaN values
%% Cell type:code id: tags:
``` python
check_NaN(subset)
```
%% Cell type:markdown id: tags:
Split the dataset into train and test sets
%% Cell type:code id: tags:
``` python
x_train, x_test, y_train, y_test = train_test_split(subset.drop(['Purchased'], axis=1),subset['Purchased'],test_size=0.2,random_state=42)
print("x train/test ",x_train.shape, x_test.shape)
print("y train/test ",y_train.shape, y_test.shape)
```
%% Cell type:code id: tags:
``` python
x_dev = x_train.values
y_dev = y_train.values
x_t = x_test.values
y_t = y_test.values
```
%% Cell type:markdown id: tags:
Normalisation of data is expected when using SVMs. Learn more here:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
%% Cell type:code id: tags:
``` python
#feature scaling
sc = StandardScaler()
x_dev = sc.fit_transform(x_dev)
x_t = sc.fit_transform(x_t)
```
%% Cell type:markdown id: tags:
#### SVM Classifier
Create the SVM, and train it on the standardised data
### Parameters for SVC: Gamma and C
A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset resulting in over-fitting.
C parameter used is to maintain regularization. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.
%% Cell type:code id: tags:
``` python
svm_classifier = SVC(kernel = 'rbf', random_state=0)#gamma=0.001, C=10
svm_classifier.fit(x_dev, y_dev)
```
%% Cell type:markdown id: tags:
#### Inference
%% Cell type:markdown id: tags:
Pass in the test set, to see how well it performs
%% Cell type:code id: tags:
``` python
predictions = svm_classifier.predict(x_t)
```
%% Cell type:markdown id: tags:
Confusion matrix shows you how many of each class were correctly and incorrectly classified
%% Cell type:code id: tags:
``` python
confusion_matrix(y_t, predictions, labels = [0,1])
```
%% Cell type:markdown id: tags:
#### Evaluation
Evaluation using mean squared error and accuracy, precision and recall
%% Cell type:code id: tags:
``` python
#mean squared error
np.mean((predictions - y_t) ** 2)
```
%% Cell type:code id: tags:
``` python
print("Accuracy:",str(metrics.accuracy_score(y_t, predictions)*100)+"%")
print("Precision:",str(round(metrics.precision_score(y_t, predictions)*100))+"%")
print("Recall:",str(round(metrics.recall_score(y_t, predictions)*100))+"%")
```
%% Cell type:markdown id: tags:
#### Visualisation
%% Cell type:code id: tags:
``` python
xs, ys = x_t, y_t
X1, X2 = np.meshgrid(np.arange(start = xs[:,0].min() - 1,stop = xs[:,0].max() + 1,step = 0.01),
np.arange(start = xs[:,1].min() - 1,stop = xs[:,1].max() + 1,step = 0.01))
plt.contourf(X1,X2, svm_classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('orange','grey')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i, j in enumerate(np.unique(ys)):
plt.scatter(xs[ys==j,0],xs[ys==j,1],
c=ListedColormap(('orange','grey'))(i),label = j)
plt.title('Test Set')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
### Exercise
Add the feature 'Gender' to the training set, and see if the accuracy improves and the mean squared error drops!
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment