Skip to content
Snippets Groups Projects
Commit 0a067f59 authored by ashepley's avatar ashepley
Browse files

Upload New File

parent 3c7cf8f8
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
### Non-linear Support Vector Machines
You can obtain the dataset used in this lecture here: https://www.kaggle.com/rakeshrau/social-network-ads/data. It is a 'categorical dataset to determine whether a user purchased a particular product'.
%% Cell type:code id: tags:
``` python
#import
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import *
import pandas as pd
from sklearn.model_selection import *
from sklearn.linear_model import LinearRegression
from sklearn.metrics import *
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
```
%% Cell type:markdown id: tags:
Here are the functions you used in the previous lectures
%% Cell type:code id: tags:
``` python
def load_data(DATASET_PATH):
return pd.read_csv(DATASET_PATH)
def check_NaN(dataframe):
print("Total NaN:", dataframe.isnull().values.sum())
print("NaN by column:\n",dataframe.isnull().sum())
return
def one_hot_encode(dataframe, col_name):
dataframe = pd.get_dummies(dataframe, columns=[col_name], prefix = [col_name])
return dataframe
```
%% Cell type:markdown id: tags:
Load the dataset, and have a look at its contents
%% Cell type:code id: tags:
``` python
DATASET_PATH = './datasets/Social_Network_Ads.csv'
```
%% Cell type:code id: tags:
``` python
#create pandas object
ads = load_data(DATASET_PATH)
ads.head()
```
%% Cell type:markdown id: tags:
We'll be training our SVM on the Age and Estimated Salary features, to predict whether the user purchased a product based on an ad or not.
%% Cell type:code id: tags:
``` python
chosen_columns = ['Age','EstimatedSalary','Purchased']
subset = ads.filter(chosen_columns)
subset.head()
```
%% Cell type:markdown id: tags:
Always check whether your subset contains NaN values
%% Cell type:code id: tags:
``` python
check_NaN(subset)
```
%% Cell type:markdown id: tags:
Split the dataset into train and test sets
%% Cell type:code id: tags:
``` python
x_train, x_test, y_train, y_test = train_test_split(subset.drop(['Purchased'], axis=1),subset['Purchased'],test_size=0.2,random_state=42)
print("x train/test ",x_train.shape, x_test.shape)
print("y train/test ",y_train.shape, y_test.shape)
```
%% Cell type:code id: tags:
``` python
x_dev = x_train.values
y_dev = y_train.values
x_t = x_test.values
y_t = y_test.values
```
%% Cell type:markdown id: tags:
Normalisation of data is expected when using SVMs. Learn more here:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
%% Cell type:code id: tags:
``` python
#feature scaling
sc = StandardScaler()
x_dev = sc.fit_transform(x_dev)
x_t = sc.fit_transform(x_t)
```
%% Cell type:markdown id: tags:
#### SVM Classifier
Create the SVM, and train it on the standardised data
### Parameters for SVC: Gamma and C
A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset resulting in over-fitting.
C parameter used is to maintain regularization. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.
%% Cell type:code id: tags:
``` python
svm_classifier = SVC(kernel = 'rbf', random_state=0)#gamma=0.001, C=10
svm_classifier.fit(x_dev, y_dev)
```
%% Cell type:markdown id: tags:
#### Inference
%% Cell type:markdown id: tags:
Pass in the test set, to see how well it performs
%% Cell type:code id: tags:
``` python
predictions = svm_classifier.predict(x_t)
```
%% Cell type:markdown id: tags:
Confusion matrix shows you how many of each class were correctly and incorrectly classified
%% Cell type:code id: tags:
``` python
confusion_matrix(y_t, predictions, labels = [0,1])
```
%% Cell type:markdown id: tags:
#### Evaluation
Evaluation using mean squared error and accuracy, precision and recall
%% Cell type:code id: tags:
``` python
#mean squared error
np.mean((predictions - y_t) ** 2)
```
%% Cell type:code id: tags:
``` python
print("Accuracy:",str(metrics.accuracy_score(y_t, predictions)*100)+"%")
print("Precision:",str(round(metrics.precision_score(y_t, predictions)*100))+"%")
print("Recall:",str(round(metrics.recall_score(y_t, predictions)*100))+"%")
```
%% Cell type:markdown id: tags:
#### Visualisation
%% Cell type:code id: tags:
``` python
xs, ys = x_t, y_t
X1, X2 = np.meshgrid(np.arange(start = xs[:,0].min() - 1,stop = xs[:,0].max() + 1,step = 0.01),
np.arange(start = xs[:,1].min() - 1,stop = xs[:,1].max() + 1,step = 0.01))
plt.contourf(X1,X2, svm_classifier.predict(np.array([X1.ravel(),X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('orange','grey')))
plt.xlim(X1.min(),X1.max())
plt.ylim(X2.min(),X2.max())
for i, j in enumerate(np.unique(ys)):
plt.scatter(xs[ys==j,0],xs[ys==j,1],
c=ListedColormap(('orange','grey'))(i),label = j)
plt.title('Test Set')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
### Exercise
Add the feature 'Gender' to the training set, and see if the accuracy improves and the mean squared error drops!
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment