diff --git a/topic_17/.ipynb_checkpoints/Untitled-checkpoint.ipynb b/topic_17/.ipynb_checkpoints/Untitled-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..35eac67af2e6e9a9ec924d04c98b18eba5d2988d --- /dev/null +++ b/topic_17/.ipynb_checkpoints/Untitled-checkpoint.ipynb @@ -0,0 +1,32 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_17/.ipynb_checkpoints/topic_17_classification_demo-checkpoint.ipynb b/topic_17/.ipynb_checkpoints/topic_17_classification_demo-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..96fe8ccf5adb3ac5bbd73563e8e62561de99e718 --- /dev/null +++ b/topic_17/.ipynb_checkpoints/topic_17_classification_demo-checkpoint.ipynb @@ -0,0 +1,222 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Topic 17 - Classification Demonstration\n", + "\n", + "In this demonstration we will train some classifier algorithms on the MNIST data set.\n", + "\n", + "We will start by loading the dataset from the Scikit-learn library.\n", + "\n", + "**Note: This may take a long time to execute - there are 70000 images.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',\n", + "'categories', 'url'])" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges \n", + "**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown \n", + "**Please cite**: \n", + "\n", + "The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples \n", + "\n", + "It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. \n", + "\n", + "With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets. \n", + "\n", + "The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.\n", + "\n", + "Downloaded from openml.org.\n" + ] + } + ], + "source": [ + "print(mnist.DESCR)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we will preview one of the images and its corresponding label." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAGaElEQVR4nO3dPUiWfR/G8dveSyprs2gOXHqhcAh6hZqsNRqiJoPKRYnAoTGorWyLpqhFcmgpEmqIIByKXiAHIaKhFrGghiJ81ucBr991Z/Z4XPr5jB6cXSfVtxP6c2rb9PT0P0CeJfN9A8DMxAmhxAmhxAmhxAmhljXZ/Vcu/H1tM33RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCLZvvG+B//fr1q9y/fPnyVz9/aGio4fb9+/fy2vHx8XK/ceNGuQ8MDDTc7t69W167atWqcr948WK5X7p0qdzngycnhBInhBInhBInhBInhBInhBInhHLOOYMPHz6U+48fP8r92bNn5f706dOG29TUVHnt8PBwuc+nLVu2lPv58+fLfWRkpOG2du3a8tpt27aV+759+8o9kScnhBInhBInhBInhBInhBInhGqbnp6u9nJsVS9evCj3gwcPlvvffm0r1dKlS8v91q1b5d7e3j7rz960aVO5b9iwody3bt0668/+P2ib6YuenBBKnBBKnBBKnBBKnBBKnBBKnBBqUZ5zTk5Olnt3d3e5T0xMzOXtzKlm997sPPDx48cNtxUrVpTXLtbz3zngnBNaiTghlDghlDghlDghlDghlDgh1KL81pgbN24s96tXr5b7/fv3y33Hjh3l3tfXV+6V7du3l/vo6Gi5N3un8s2bNw23a9euldcytzw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IdSifJ/zT339+rXcm/24ut7e3obbzZs3y2tv375d7idOnCh3InmfE1qJOCGUOCGUOCGUOCGUOCGUOCHUonyf80+tW7fuj65fv379rK9tdg56/Pjxcl+yxL/HrcKfFIQSJ4QSJ4QSJ4QSJ4QSJ4Tyytg8+PbtW8Otp6envPbJkyfl/uDBg3I/fPhwuTMvvDIGrUScEEqcEEqcEEqcEEqcEEqcEMo5Z5iJiYly37lzZ7l3dHSU+4EDB8p9165dDbezZ8+W17a1zXhcR3POOaGViBNCiRNCiRNCiRNCiRNCiRNCOedsMSMjI+V++vTpcm/24wsrly9fLveTJ0+We2dn56w/e4FzzgmtRJwQSpwQSpwQSpwQSpwQSpwQyjnnAvP69ety7+/vL/fR0dFZf/aZM2fKfXBwsNw3b948689ucc45oZWIE0KJE0KJE0KJE0KJE0KJE0I551xkpqamyv3+/fsNt1OnTpXXNvm79M+hQ4fK/dGjR+W+gDnnhFYiTgglTgglTgglTgglTgjlKIV/beXKleX+8+fPcl++fHm5P3z4sOG2f//+8toW5ygFWok4IZQ4IZQ4IZQ4IZQ4IZQ4IdSy+b4B5tarV6/KfXh4uNzHxsYabs3OMZvp6uoq97179/7Rr7/QeHJCKHFCKHFCKHFCKHFCKHFCKHFCKOecYcbHx8v9+vXr5X7v3r1y//Tp02/f07+1bFn916mzs7PclyzxrPhvfjcglDghlDghlDghlDghlDghlDghlHPOv6DZWeKdO3cabkNDQ+W179+/n80tzYndu3eX++DgYLkfPXp0Lm9nwfPkhFDihFDihFDihFDihFDihFCOUmbw+fPncn/79m25nzt3rtzfvXv32/c0V7q7u8v9woULDbdjx46V13rla2753YRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQC/acc3JysuHW29tbXvvy5ctyn5iYmNU9zYU9e/aUe39/f7kfOXKk3FevXv3b98Tf4ckJocQJocQJocQJocQJocQJocQJoWLPOZ8/f17uV65cKfexsbGG28ePH2d1T3NlzZo1Dbe+vr7y2mbffrK9vX1W90QeT04IJU4IJU4IJU4IJU4IJU4IJU4IFXvOOTIy8kf7n+jq6ir3np6ecl+6dGm5DwwMNNw6OjrKa1k8PDkhlDghlDghlDghlDghlDghlDghVNv09HS1lyMwJ9pm+qInJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4Rq9iMAZ/yWfcDf58kJocQJocQJocQJocQJocQJof4DO14Dhyk10VwAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "#Assign the data and the targets\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "#Display the first image\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'5'" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#The corresponding class label\n", + "y[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we will partition the data into a training and testing sets. We are going to train a binary classifier that detects the number \"5\", so we need to find all of the 5s and mark them as \"true\" in the train and test sets." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now its time to train the classifier. We will train a stochastic gradient descent classifier.\n", + "\n", + "We will set the random state for reproducability.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n", + " early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n", + " l1_ratio=0.15, learning_rate='optimal', loss='hinge',\n", + " max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',\n", + " power_t=0.5, random_state=42, shuffle=True, tol=0.001,\n", + " validation_fraction=0.1, verbose=0, warm_start=False)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can do a simple test to see if our classifier can classify a an image from the training set. We will look at a full test later.\n", + "\n", + "The letter in `X[0]` represents a \"5\", hence a result of \"true\" is expected." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ True])" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sgd_clf.predict([X[0]])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_17/Untitled.ipynb b/topic_17/Untitled.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..35eac67af2e6e9a9ec924d04c98b18eba5d2988d --- /dev/null +++ b/topic_17/Untitled.ipynb @@ -0,0 +1,32 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_17/topic_17.md b/topic_17/topic_17.md new file mode 100644 index 0000000000000000000000000000000000000000..24f9546a7ee60e19febcba31c2b437d1f62ee1fe --- /dev/null +++ b/topic_17/topic_17.md @@ -0,0 +1,154 @@ +class: bottom, left +background-image: url(assets/g.png) + +<h2 class="title_headings_sml">COSC102 - Data Science Studio 1</h2> + +<h1 class="title_headings_sml"> Topic 17 - Classification using Machine Learning </h1> + +<h3 class="title_headings_sml"> Dr. Mitchell Welch </h3> + +--- + +## Reading + +* Chapter 3 from ***Hands-on Machine Learning with Scikit-Learn & TensorFlow*** + +--- +## Summary + +* Machine Learning & Classification +* MNIST Dataset +* Partitioning the Data Set +* Training the Classifier +--- + +## Machine Learning & Classification + +* Recall from the previous topics that the most common supervised learning tasks are: + * Regression (predicting values) + * Classification (predicting category classes) +* In machine learning, classification refers to a predictive modeling problem where a class label is predicted for a given example of input data. +* Examples: Activity recognition from accelerometer sensors, natural language processing, plant species classification. +--- + +## Machine Learning & Classification + +.center[] + + +--- +## Machine Learning & Classification + +* Classification requires a training dataset with **many** examples of inputs and outputs from which to learn. +* *Binary classification* refers to those classification tasks that have two class labels. +* *Multi-class classification* refers to those classification tasks that have more than two class labels. + +--- + +## MNIST Dataset + +* To continue our discussion on machine learning, we will introduce the *MNIST* dataset. +* The MNIST dataset contains 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. +* Each image is labeled with the digit it represents. +* The digit images (technically the pixels from the images) are the input features and the ground-truth labels are the output. +* We will train some classifier algorithms to predict the digit given the image. + + +--- + +## MNIST Dataset + +* A copy of the MNIST dataset is available within Scikit-learn. +* Review the demonstration notebook to how to load and preview the data. +* There are 70,000 images, and each image has 784 features. +* This is because each image is 28 × 28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black). + +--- +## MNIST Dataset + +.center[] + +--- + +## Partitioning the Data Set + +* Before we move on to train our classifier, we need to divide our data set into a *training* and *testing* set +* The training set is used to train and tune the classifier. +* The test set is used to test the predictive ability of the classifier. +* None of the data items in the test set should be used as part of the training process. + +--- + +## Partitioning the Data Set + +* If we look at the MNIST dataset packaged with Scikit-Learn, we can see it has test/train partitions built in: + * The first 60000 images make up the training set. + * The last 10000 images make up the testing set. +* This partitioning ensures that there are equal examples of each digit within each set. + +```python +X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] +``` + +--- + +## Training the Classifier + +* Now we can investigate some classifier algorithms. +* Let’s simplify the problem for now and only try to identify one digit—for example, the number "5" +* This “5-detector” will be an example of a binary classifier, capable of distinguishing between just two classes, "5" and "not-5" + +--- +## Training the Classifier + +* We will train a linear classifier using Stochastic Gradient Descent (SGD) algorithm. +* Strictly speaking, SGD is merely an optimization technique and does not correspond to a specific family of machine learning models. +* We will be using the default model: the *Support Vector Machine (SVM)* +* In a later topic we will cover the workings of the SVM. +--- + +## Training the Classifier + +* The Support Vector Machine (SVM) works by constructing a *hyperplane* - essentially a decision boundary that separates the classes within the data. +* This algorithm is very efficient and works well on data the is *linearly separable* (classes that can be separated by a straight line within the features) +* This classifier has the advantage of being capable of handling very large datasets efficiently. + +--- +## Training the Classifier + +.center[] + +--- +## Training the Classifier + +* Now its time to actually train the classifier using the training set. + +```python +from sklearn.linear_model import SGDClassifier +sgd_clf = SGDClassifier(random_state=42) +sgd_clf.fit(X_train, y_train_5) + +# A quick test on our classifier +sgd_clf.predict([X[0]]) + +``` + +--- +## Summary + +* Machine Learning & Classification +* MNIST Dataset +* Partitioning the Data Set +* Training the Classifier + +--- + +## Reading + +* Chapter 3 from ***Hands-on Machine Learning with Scikit-Learn & TensorFlow*** + +--- + + + + diff --git a/topic_17/topic_17_classification_demo.ipynb b/topic_17/topic_17_classification_demo.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..0292ffa265337bd544f10f731d4e9c4f43726714 --- /dev/null +++ b/topic_17/topic_17_classification_demo.ipynb @@ -0,0 +1,222 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Topic 17 - Classification Demonstration\n", + "\n", + "In this demonstration we will train some classifier algorithms on the MNIST data set.\n", + "\n", + "We will start by loading the dataset from the Scikit-learn library.\n", + "\n", + "**Note: This may take a long time to execute - there are 70000 images.**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',\n", + "'categories', 'url'])" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "**Author**: Yann LeCun, Corinna Cortes, Christopher J.C. Burges \n", + "**Source**: [MNIST Website](http://yann.lecun.com/exdb/mnist/) - Date unknown \n", + "**Please cite**: \n", + "\n", + "The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples \n", + "\n", + "It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. \n", + "\n", + "With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets. \n", + "\n", + "The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.\n", + "\n", + "Downloaded from openml.org.\n" + ] + } + ], + "source": [ + "print(mnist.DESCR)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we will preview one of the images and its corresponding label." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAFPUlEQVR4nO3dPUiVbRzHcc+TREI2NDRGiAiBSxIO7dUgtAbq5qrQWGsvaGJULg0OQktTQXtuDr6Ak1NbNLRk5KKjTT0Qj/ffB092fsc+n7Ef1+lQfLmhi9taBwcHPUCefzr9BYDDiRNCiRNCiRNCiRNC9R6x+6dcOHmtw37RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNC9Xb6C8BPnz59atz29/fLs2/evCn3V69eHes7/TQ2Nta4LS8vt/XZTTw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IZR7Tn6bDx8+lPu7d+/Kvbqr/P79e3m21WqVe7vW1tZO9PMP48kJocQJocQJocQJocQJocQJocQJodxz8oupqanGbXt7uzy7sbHxu7/Ovy5cuFDuExMT5X79+vVyHx8fL/dz586V+0nw5IRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQrYODg2ovR/Ls7OyU+4MHD8p9aWmpcbt48WJ5dmBgoNzv379f7sPDw41bX19fefby5cvlHu7Ql1E9OSGUOCGUOCGUOCGUOCGUOCGUOCGUe85T5t69e+W+uLhY7jMzM43bkydPyrPnz58vdxq554RuIk4IJU4IJU4IJU4IJU4I5SqlA/b29hq3p0+flmdfv35d7i9fviz3I/6+e27fvt24deLHQ/4lXKVANxEnhBInhBInhBInhBInhBInhPJfAHbA48ePG7e5ubny7N27d8v91q1b5e6usnt4ckIocUIocUIocUIocUIocUIocUIo73N2QKt16Ot7/8v79+/L/c6dO8f+bDrG+5zQTcQJocQJocQJocQJocQJocQJobzP2QGjo6ON2+bmZnl2enq63Pv6+sr95s2b5U4OT04IJU4IJU4IJU4IJU4IJU4IJU4I5X3OQ6yvr5f7tWvXyv3s2bPl/u3bt8ZtcXGxPPvw4cNy7+/vL/e1tbVyv3r1arlzIrzPCd1EnBBKnBBKnBBKnBBKnBDq1F6lfPnypXEbGxsrz37+/Lncnz9/Xu6Tk5PlXvn69Wu5X7p06dif3dPT07O6ulruN27caOvzORZXKdBNxAmhxAmhxAmhxAmhxAmhxAmhTu2PxhwZGWncdnd3y7Pz8/Pl3s495lFevHjR1vmjfvTl8PBwW5/Pn+PJCaHECaHECaHECaHECaHECaHECaFO7fucs7OzjdujR4/Ks/v7+7/76/xiaGiocfv48WN59sqVK+X+9u3bcq/uf+kY73NCNxEnhBInhBInhBInhBInhBInhDq195yVhYWFct/a2ir3lZWVtn7/6s98dHS0PPvs2bNyHxwcLPczZ86UOx3hnhO6iTghlDghlDghlDghlDgh1F95lQJhXKVANxEnhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhBInhOo9Ym/9kW8B/IcnJ4QSJ4QSJ4QSJ4QSJ4QSJ4T6AZAorng0D88IAAAAAElFTkSuQmCC\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "#Assign the data and the targets\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "#Display the first image\n", + "some_digit = X[11]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'5'" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#The corresponding class label\n", + "y[11]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we will partition the data into a training and testing sets. We are going to train a binary classifier that detects the number \"5\", so we need to find all of the 5s and mark them as \"true\" in the train and test sets." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now its time to train the classifier. We will train a stochastic gradient descent classifier.\n", + "\n", + "We will set the random state for reproducability.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n", + " early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n", + " l1_ratio=0.15, learning_rate='optimal', loss='hinge',\n", + " max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',\n", + " power_t=0.5, random_state=42, shuffle=True, tol=0.001,\n", + " validation_fraction=0.1, verbose=0, warm_start=False)" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can do a simple test to see if our classifier can classify a an image from the training set. We will look at a full test later.\n", + "\n", + "The letter in `X[0]` represents a \"5\", hence a result of \"true\" is expected." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ True])" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sgd_clf.predict([X[11]])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_18/.ipynb_checkpoints/Untitled-checkpoint.ipynb b/topic_18/.ipynb_checkpoints/Untitled-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..146acb87f0df56fbe0f239b95d44e36269ea3144 --- /dev/null +++ b/topic_18/.ipynb_checkpoints/Untitled-checkpoint.ipynb @@ -0,0 +1,67 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',\n", + "'categories', 'url'])\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n", + "\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_18/.ipynb_checkpoints/performance_demo-checkpoint.ipynb b/topic_18/.ipynb_checkpoints/performance_demo-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..143ea6e97e9cfb97731a7cfffd6ae228f7f877c5 --- /dev/null +++ b/topic_18/.ipynb_checkpoints/performance_demo-checkpoint.ipynb @@ -0,0 +1,100 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAGaElEQVR4nO3dPUiWfR/G8dveSyprs2gOXHqhcAh6hZqsNRqiJoPKRYnAoTGorWyLpqhFcmgpEmqIIByKXiAHIaKhFrGghiJ81ucBr991Z/Z4XPr5jB6cXSfVtxP6c2rb9PT0P0CeJfN9A8DMxAmhxAmhxAmhxAmhljXZ/Vcu/H1tM33RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCLZvvG+B//fr1q9y/fPnyVz9/aGio4fb9+/fy2vHx8XK/ceNGuQ8MDDTc7t69W167atWqcr948WK5X7p0qdzngycnhBInhBInhBInhBInhBInhBInhHLOOYMPHz6U+48fP8r92bNn5f706dOG29TUVHnt8PBwuc+nLVu2lPv58+fLfWRkpOG2du3a8tpt27aV+759+8o9kScnhBInhBInhBInhBInhBInhGqbnp6u9nJsVS9evCj3gwcPlvvffm0r1dKlS8v91q1b5d7e3j7rz960aVO5b9iwody3bt0668/+P2ib6YuenBBKnBBKnBBKnBBKnBBKnBBKnBBqUZ5zTk5Olnt3d3e5T0xMzOXtzKlm997sPPDx48cNtxUrVpTXLtbz3zngnBNaiTghlDghlDghlDghlDghlDgh1KL81pgbN24s96tXr5b7/fv3y33Hjh3l3tfXV+6V7du3l/vo6Gi5N3un8s2bNw23a9euldcytzw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IdSifJ/zT339+rXcm/24ut7e3obbzZs3y2tv375d7idOnCh3InmfE1qJOCGUOCGUOCGUOCGUOCGUOCHUonyf80+tW7fuj65fv379rK9tdg56/Pjxcl+yxL/HrcKfFIQSJ4QSJ4QSJ4QSJ4QSJ4Tyytg8+PbtW8Otp6envPbJkyfl/uDBg3I/fPhwuTMvvDIGrUScEEqcEEqcEEqcEEqcEEqcEMo5Z5iJiYly37lzZ7l3dHSU+4EDB8p9165dDbezZ8+W17a1zXhcR3POOaGViBNCiRNCiRNCiRNCiRNCiRNCOedsMSMjI+V++vTpcm/24wsrly9fLveTJ0+We2dn56w/e4FzzgmtRJwQSpwQSpwQSpwQSpwQSpwQyjnnAvP69ety7+/vL/fR0dFZf/aZM2fKfXBwsNw3b948689ucc45oZWIE0KJE0KJE0KJE0KJE0KJE0I551xkpqamyv3+/fsNt1OnTpXXNvm79M+hQ4fK/dGjR+W+gDnnhFYiTgglTgglTgglTgglTgjlKIV/beXKleX+8+fPcl++fHm5P3z4sOG2f//+8toW5ygFWok4IZQ4IZQ4IZQ4IZQ4IZQ4IdSy+b4B5tarV6/KfXh4uNzHxsYabs3OMZvp6uoq97179/7Rr7/QeHJCKHFCKHFCKHFCKHFCKHFCKHFCKOecYcbHx8v9+vXr5X7v3r1y//Tp02/f07+1bFn916mzs7PclyzxrPhvfjcglDghlDghlDghlDghlDghlDghlHPOv6DZWeKdO3cabkNDQ+W179+/n80tzYndu3eX++DgYLkfPXp0Lm9nwfPkhFDihFDihFDihFDihFDihFCOUmbw+fPncn/79m25nzt3rtzfvXv32/c0V7q7u8v9woULDbdjx46V13rla2753YRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQC/acc3JysuHW29tbXvvy5ctyn5iYmNU9zYU9e/aUe39/f7kfOXKk3FevXv3b98Tf4ckJocQJocQJocQJocQJocQJocQJoWLPOZ8/f17uV65cKfexsbGG28ePH2d1T3NlzZo1Dbe+vr7y2mbffrK9vX1W90QeT04IJU4IJU4IJU4IJU4IJU4IJU4IFXvOOTIy8kf7n+jq6ir3np6ecl+6dGm5DwwMNNw6OjrKa1k8PDkhlDghlDghlDghlDghlDghlDghVNv09HS1lyMwJ9pm+qInJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4Rq9iMAZ/yWfcDf58kJocQJocQJocQJocQJocQJof4DO14Dhyk10VwAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n", + " early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n", + " l1_ratio=0.15, learning_rate='optimal', loss='hinge',\n", + " max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',\n", + " power_t=0.5, random_state=42, shuffle=True, tol=0.001,\n", + " validation_fraction=0.1, verbose=0, warm_start=False)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n", + "\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_18/Untitled.ipynb b/topic_18/Untitled.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..146acb87f0df56fbe0f239b95d44e36269ea3144 --- /dev/null +++ b/topic_18/Untitled.ipynb @@ -0,0 +1,67 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',\n", + "'categories', 'url'])\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n", + "\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_18/topic_18.md b/topic_18/topic_18.md new file mode 100644 index 0000000000000000000000000000000000000000..23c1d80081068ad8b5d375b640d61d0144efdfd7 --- /dev/null +++ b/topic_18/topic_18.md @@ -0,0 +1,158 @@ +class: bottom, left +background-image: url(assets/g.png) + +<h2 class="title_headings_sml">COSC102 - Data Science Studio 1</h2> + +<h1 class="title_headings_sml"> Topic 18 - Classification Performance </h1> + +<h3 class="title_headings_sml"> Dr. Mitchell Welch </h3> + +--- + +## Reading + +* Chapter 3 from ***Hands-on Machine Learning with Scikit-Learn & TensorFlow*** + +--- +## Summary + +* Classifier Performance +* Confusion Matrix +* Precision/Recall +--- + +## Classifier Performance + +* Carrying on from the previous topic, we have: + * Trained a simple classifier. + * Used the classifier to predict individual values +* Now we will complete a more comprehensive assessment of well the classifier performs. +* To achieve this, we will be using the Scikit-learn built-in cross validation functions. + +--- + +## Classifier Performance + +* One of the easiest ways to assess the performance is with *K-Fold* cross-validation. +* The process involves splitting the dataset into *folds* (subsets) +* All but one fold is used to train the classifier, and the remaining fold is used to test the classifier. +* This process is repeated for all combinations of the folds and performance measures are averaged across tests on all combinations of folds. + + +--- + +## Classifier Performance + +.center[] + +--- + +## Classifier Performance + +* This is all built in to Scikit-learn: +* Our fist try: + +```python +from sklearn.model_selection import cross_val_score +cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy") + +``` + +--- + +## Classifier Performance + +* At first glance this looks good, with 90+% accuracy. But this is hiding a very ineffective classifier. +* The problem is that the classifier has been trained and assessed on an unbalanced dataset. +* Only about 10% of the dataset contains '5's. +* It just classifiers everything as in the 'not-5' class to achieve this accuracy. +* We need a better way to assess the performance. + +--- + +## Confusion Matrix + +* We will look at using a *Confusion Matrix* +* The general idea is to count the number of times instances of class A are +classified as class B. +* A confusion matrix record: + * True positives (TP) - correctly identified positive samples + * True negatives (TN) - correctly identified negative samples + * False positives (FP) - incorrectly identified positive samples + * False negatives (FP) - incorrectly identified negative samples + + +--- +## Confusion Matrix + +.center[] + +--- + +## Confusion Matrix + +* To compute the confusion matrix, you first need to have a set of predictions so that +they can be compared to the actual targets. +* You could make predictions on the test set, but let’s keep it untouched for now (remember that you want to use the test set only at the very end of your project) +* We will use the `cross_val_predict( ... )` to conduct a k-fold cross validation and return the performance for each fold. +* Review the [demo notebook](). + +--- + +## Precision/Recall + +* Using the results from the confusion matrix, we can calculate some concise metrics. +* *Precision* - the accuracy of the positive predictions. + +$$ +precision = TP / (TP + FP) +$$ + +* *Recall* (Sometimes called *Sensitivity*) - The ratio of positive instances that are correctly detected by the classifier. + +$$ +recall = TP /(TP + FN) +$$ + +--- + +## Precision/Recall + +* Review the demo notebook to see this in action: + +```python +from sklearn.metrics import precision_score, recall_score +print(precision_score(y_train_5, y_train_pred)) # == 4096 / (4096 + 1522) +print(recall_score(y_train_5, y_train_pred)) # == 4096 / (4096 + 1325) + +``` + +--- + +## Precision/Recall + +* Precision and recall can be summarized by calculating the *F1* score: + +$$ +F1 = TP / (TP + ((FN+FP)/2)) +$$ + +* This produces a harmonic mean that is only high if *both* the recall and precision are high. + +--- +## Summary + +* So far we have: + * Trained a simple binary classifier (our '5' detector). + * Tested it with simple predictions. + * Assessed the performance using a confusion matrix. + * Calculated the precision, recall and F1 scores. + + +--- +## Summary + +* Classifier Performance +* Confusion Matrix +* Precision/Recall +--- \ No newline at end of file diff --git a/topic_18/topic_18_performance_demo.ipynb b/topic_18/topic_18_performance_demo.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..5781e71a1c0df34be8f8e3b9cfca218747952b35 --- /dev/null +++ b/topic_18/topic_18_performance_demo.ipynb @@ -0,0 +1,400 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Topic 18 - Classifier Performance\n", + "\n", + "In this demonstration we will continue on from where we started in topic 17. First we will train a classifier algorithm on the MNIST data set." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAGaElEQVR4nO3dPUiWfR/G8dveSyprs2gOXHqhcAh6hZqsNRqiJoPKRYnAoTGorWyLpqhFcmgpEmqIIByKXiAHIaKhFrGghiJ81ucBr991Z/Z4XPr5jB6cXSfVtxP6c2rb9PT0P0CeJfN9A8DMxAmhxAmhxAmhxAmhljXZ/Vcu/H1tM33RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCLZvvG+B//fr1q9y/fPnyVz9/aGio4fb9+/fy2vHx8XK/ceNGuQ8MDDTc7t69W167atWqcr948WK5X7p0qdzngycnhBInhBInhBInhBInhBInhBInhHLOOYMPHz6U+48fP8r92bNn5f706dOG29TUVHnt8PBwuc+nLVu2lPv58+fLfWRkpOG2du3a8tpt27aV+759+8o9kScnhBInhBInhBInhBInhBInhGqbnp6u9nJsVS9evCj3gwcPlvvffm0r1dKlS8v91q1b5d7e3j7rz960aVO5b9iwody3bt0668/+P2ib6YuenBBKnBBKnBBKnBBKnBBKnBBKnBBqUZ5zTk5Olnt3d3e5T0xMzOXtzKlm997sPPDx48cNtxUrVpTXLtbz3zngnBNaiTghlDghlDghlDghlDghlDgh1KL81pgbN24s96tXr5b7/fv3y33Hjh3l3tfXV+6V7du3l/vo6Gi5N3un8s2bNw23a9euldcytzw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IdSifJ/zT339+rXcm/24ut7e3obbzZs3y2tv375d7idOnCh3InmfE1qJOCGUOCGUOCGUOCGUOCGUOCHUonyf80+tW7fuj65fv379rK9tdg56/Pjxcl+yxL/HrcKfFIQSJ4QSJ4QSJ4QSJ4QSJ4Tyytg8+PbtW8Otp6envPbJkyfl/uDBg3I/fPhwuTMvvDIGrUScEEqcEEqcEEqcEEqcEEqcEMo5Z5iJiYly37lzZ7l3dHSU+4EDB8p9165dDbezZ8+W17a1zXhcR3POOaGViBNCiRNCiRNCiRNCiRNCiRNCOedsMSMjI+V++vTpcm/24wsrly9fLveTJ0+We2dn56w/e4FzzgmtRJwQSpwQSpwQSpwQSpwQSpwQyjnnAvP69ety7+/vL/fR0dFZf/aZM2fKfXBwsNw3b948689ucc45oZWIE0KJE0KJE0KJE0KJE0KJE0I551xkpqamyv3+/fsNt1OnTpXXNvm79M+hQ4fK/dGjR+W+gDnnhFYiTgglTgglTgglTgglTgjlKIV/beXKleX+8+fPcl++fHm5P3z4sOG2f//+8toW5ygFWok4IZQ4IZQ4IZQ4IZQ4IZQ4IdSy+b4B5tarV6/KfXh4uNzHxsYabs3OMZvp6uoq97179/7Rr7/QeHJCKHFCKHFCKHFCKHFCKHFCKHFCKOecYcbHx8v9+vXr5X7v3r1y//Tp02/f07+1bFn916mzs7PclyzxrPhvfjcglDghlDghlDghlDghlDghlDghlHPOv6DZWeKdO3cabkNDQ+W179+/n80tzYndu3eX++DgYLkfPXp0Lm9nwfPkhFDihFDihFDihFDihFDihFCOUmbw+fPncn/79m25nzt3rtzfvXv32/c0V7q7u8v9woULDbdjx46V13rla2753YRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQC/acc3JysuHW29tbXvvy5ctyn5iYmNU9zYU9e/aUe39/f7kfOXKk3FevXv3b98Tf4ckJocQJocQJocQJocQJocQJocQJoWLPOZ8/f17uV65cKfexsbGG28ePH2d1T3NlzZo1Dbe+vr7y2mbffrK9vX1W90QeT04IJU4IJU4IJU4IJU4IJU4IJU4IFXvOOTIy8kf7n+jq6ir3np6ecl+6dGm5DwwMNNw6OjrKa1k8PDkhlDghlDghlDghlDghlDghlDghVNv09HS1lyMwJ9pm+qInJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4Rq9iMAZ/yWfcDf58kJocQJocQJocQJocQJocQJof4DO14Dhyk10VwAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "SGDClassifier(alpha=0.0001, average=False, class_weight=None,\n", + " early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,\n", + " l1_ratio=0.15, learning_rate='optimal', loss='hinge',\n", + " max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',\n", + " power_t=0.5, random_state=42, shuffle=True, tol=0.001,\n", + " validation_fraction=0.1, verbose=0, warm_start=False)" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n", + "\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here we will run a 3-fold cross validation on the classifier that we have trained using the training set. " + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0.95035, 0.96035, 0.9604 ])" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.model_selection import cross_val_score\n", + "cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring=\"accuracy\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At first glance this looks good, with 90+% accuracy. This is hiding a very ineffective classifier. The problem is that the classifier has been trained and assessed on an unbalanced dataset. \n", + "\n", + "**Only about 10% of the dataset contains '5's.**\n", + "\n", + "Lets construct a confusion matrix to see what is really happening. We will start by using `cross_val_predict( ... ) ` to test the classifier. This will return a list of the predictions made by the classifier.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import cross_val_predict\n", + "y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can calculate a confusion matrix using the `confusion_matrix( ... )` function." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[53892, 687],\n", + " [ 1891, 3530]], dtype=int64)" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix(y_train_5, y_train_pred)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From the confuction matrix, it is evident that there are actually many mis-classifications.\n", + "\n", + "Using the results from the confusion matrix, we can calculate some more concise metrics.\n", + "* *Precision* - the accuracy of the positive predictions.\n", + "\n", + "$$\n", + "precision = TP / (TP + FP)\n", + "$$\n", + "\n", + "* *Recall* (Sometimes called *Sensitivity*) - The ratio of positive instances that are correctly detected by the classifier. \n", + "\n", + "$$\n", + "recall = TP /(TP + FN)\n", + "$$" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.8370879772350012\n", + "0.6511713705958311\n" + ] + } + ], + "source": [ + "from sklearn.metrics import precision_score, recall_score\n", + "print(precision_score(y_train_5, y_train_pred)) # == 3530 / (3530 + 1891)\n", + "print(recall_score(y_train_5, y_train_pred)) # == 3530 / (3530 + 687)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the classifier claims a value is a 5, it is only correct 83% of the time (Precision).\n", + "The classifier only detects 65% of the 5s in the dataset \n", + "\n", + "Precision and recall can be summarized by calculating the *F1* score:\n", + "\n", + "$$\n", + "F1 = TP / (TP + ((FN+FP)/2))\n", + "$$" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.7325171197343846" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import f1_score\n", + "f1_score(y_train_5, y_train_pred)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The F1 score summarises the recall and precision. This produces a harmonic mean that is only high if *both* the recall and precision are high. " + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,\n", + "method=\"decision_function\")\n", + "\n", + "from sklearn.metrics import precision_recall_curve\n", + "precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):\n", + " plt.plot(thresholds, precisions[:-1], \"b--\", label=\"Precision\", linewidth=2)\n", + " plt.plot(thresholds, recalls[:-1], \"g-\", label=\"Recall\", linewidth=2)\n", + " plt.xlabel(\"Threshold\", fontsize=16)\n", + " plt.legend(loc=\"upper left\", fontsize=16)\n", + " plt.ylim([0, 1])\n", + "\n", + "plt.figure(figsize=(8, 4))\n", + "plot_precision_recall_vs_threshold(precisions, recalls, thresholds)\n", + "plt.xlim([-100000, 100000])\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x432 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_precision_vs_recall(precisions, recalls):\n", + " plt.plot(recalls, precisions, \"b-\", linewidth=2)\n", + " plt.xlabel(\"Recall\", fontsize=16)\n", + " plt.ylabel(\"Precision\", fontsize=16)\n", + " plt.axis([0, 1, 0, 1])\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plot_precision_vs_recall(precisions, recalls)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics import roc_curve\n", + "\n", + "fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x432 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_roc_curve(fpr, tpr, label=None):\n", + " plt.plot(fpr, tpr, linewidth=2, label=label)\n", + " plt.plot([0, 1], [0, 1], 'k--')\n", + " plt.axis([0, 1, 0, 1])\n", + " plt.xlabel('False Positive Rate', fontsize=16)\n", + " plt.ylabel('True Positive Rate', fontsize=16)\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plot_roc_curve(fpr, tpr)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9604938554008616" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "roc_auc_score(y_train_5, y_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_19/.ipynb_checkpoints/Untitled-checkpoint.ipynb b/topic_19/.ipynb_checkpoints/Untitled-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7fec51502cbc3200b3d0ffc6bbba1fe85e197f3d --- /dev/null +++ b/topic_19/.ipynb_checkpoints/Untitled-checkpoint.ipynb @@ -0,0 +1,6 @@ +{ + "cells": [], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_19/.ipynb_checkpoints/roc_demo-checkpoint.ipynb b/topic_19/.ipynb_checkpoints/roc_demo-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..c108758f94a1a9f26651e0068219422e0c0dd62d --- /dev/null +++ b/topic_19/.ipynb_checkpoints/roc_demo-checkpoint.ipynb @@ -0,0 +1,287 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Topic 19 -ROC Performance Demo\n", + "\n", + "In this demonstration we will continue on from where we started in topic 17. First we will train a classifier algorithm on the MNIST data set. Then we will calulate or performance metrics and analyse these using a ROC curve." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAGaElEQVR4nO3dPUiWfR/G8dveSyprs2gOXHqhcAh6hZqsNRqiJoPKRYnAoTGorWyLpqhFcmgpEmqIIByKXiAHIaKhFrGghiJ81ucBr991Z/Z4XPr5jB6cXSfVtxP6c2rb9PT0P0CeJfN9A8DMxAmhxAmhxAmhxAmhljXZ/Vcu/H1tM33RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCLZvvG+B//fr1q9y/fPnyVz9/aGio4fb9+/fy2vHx8XK/ceNGuQ8MDDTc7t69W167atWqcr948WK5X7p0qdzngycnhBInhBInhBInhBInhBInhBInhHLOOYMPHz6U+48fP8r92bNn5f706dOG29TUVHnt8PBwuc+nLVu2lPv58+fLfWRkpOG2du3a8tpt27aV+759+8o9kScnhBInhBInhBInhBInhBInhGqbnp6u9nJsVS9evCj3gwcPlvvffm0r1dKlS8v91q1b5d7e3j7rz960aVO5b9iwody3bt0668/+P2ib6YuenBBKnBBKnBBKnBBKnBBKnBBKnBBqUZ5zTk5Olnt3d3e5T0xMzOXtzKlm997sPPDx48cNtxUrVpTXLtbz3zngnBNaiTghlDghlDghlDghlDghlDgh1KL81pgbN24s96tXr5b7/fv3y33Hjh3l3tfXV+6V7du3l/vo6Gi5N3un8s2bNw23a9euldcytzw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IdSifJ/zT339+rXcm/24ut7e3obbzZs3y2tv375d7idOnCh3InmfE1qJOCGUOCGUOCGUOCGUOCGUOCHUonyf80+tW7fuj65fv379rK9tdg56/Pjxcl+yxL/HrcKfFIQSJ4QSJ4QSJ4QSJ4QSJ4Tyytg8+PbtW8Otp6envPbJkyfl/uDBg3I/fPhwuTMvvDIGrUScEEqcEEqcEEqcEEqcEEqcEMo5Z5iJiYly37lzZ7l3dHSU+4EDB8p9165dDbezZ8+W17a1zXhcR3POOaGViBNCiRNCiRNCiRNCiRNCiRNCOedsMSMjI+V++vTpcm/24wsrly9fLveTJ0+We2dn56w/e4FzzgmtRJwQSpwQSpwQSpwQSpwQSpwQyjnnAvP69ety7+/vL/fR0dFZf/aZM2fKfXBwsNw3b948689ucc45oZWIE0KJE0KJE0KJE0KJE0KJE0I551xkpqamyv3+/fsNt1OnTpXXNvm79M+hQ4fK/dGjR+W+gDnnhFYiTgglTgglTgglTgglTgjlKIV/beXKleX+8+fPcl++fHm5P3z4sOG2f//+8toW5ygFWok4IZQ4IZQ4IZQ4IZQ4IZQ4IdSy+b4B5tarV6/KfXh4uNzHxsYabs3OMZvp6uoq97179/7Rr7/QeHJCKHFCKHFCKHFCKHFCKHFCKHFCKOecYcbHx8v9+vXr5X7v3r1y//Tp02/f07+1bFn916mzs7PclyzxrPhvfjcglDghlDghlDghlDghlDghlDghlHPOv6DZWeKdO3cabkNDQ+W179+/n80tzYndu3eX++DgYLkfPXp0Lm9nwfPkhFDihFDihFDihFDihFDihFCOUmbw+fPncn/79m25nzt3rtzfvXv32/c0V7q7u8v9woULDbdjx46V13rla2753YRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQC/acc3JysuHW29tbXvvy5ctyn5iYmNU9zYU9e/aUe39/f7kfOXKk3FevXv3b98Tf4ckJocQJocQJocQJocQJocQJocQJoWLPOZ8/f17uV65cKfexsbGG28ePH2d1T3NlzZo1Dbe+vr7y2mbffrK9vX1W90QeT04IJU4IJU4IJU4IJU4IJU4IJU4IFXvOOTIy8kf7n+jq6ir3np6ecl+6dGm5DwwMNNw6OjrKa1k8PDkhlDghlDghlDghlDghlDghlDghVNv09HS1lyMwJ9pm+qInJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4Rq9iMAZ/yWfcDf58kJocQJocQJocQJocQJocQJof4DO14Dhyk10VwAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Precision: 0.8370879772350012\n", + "Recall: 0.6511713705958311\n", + "F1 Score: 0.7325171197343846\n" + ] + } + ], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n", + "\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)\n", + "\n", + "from sklearn.model_selection import cross_val_score\n", + "cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring=\"accuracy\")\n", + "\n", + "from sklearn.model_selection import cross_val_predict\n", + "y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)\n", + "\n", + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix(y_train_5, y_train_pred)\n", + "\n", + "from sklearn.metrics import precision_score, recall_score\n", + "print(\"Precision:\",precision_score(y_train_5, y_train_pred)) # == 3530 / (3530 + 1891)\n", + "print(\"Recall:\", recall_score(y_train_5, y_train_pred)) # == 3530 / (3530 + 687)\n", + "\n", + "from sklearn.metrics import f1_score\n", + "print(\"F1 Score:\",f1_score(y_train_5, y_train_pred))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can calculate the scores used to analyse the recall-precision trade-off. We will done by setting the method to \"decision_function\". The scores can then be used to calulate the precision and recalls for the range of thresholds used by the classifier." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,\n", + "method=\"decision_function\")\n", + "\n", + "from sklearn.metrics import precision_recall_curve\n", + "precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can plot the recalls and precisions against threhold values to understand the nature of the trade-off. We will define a function that will print the two data series. " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):\n", + " plt.plot(thresholds, precisions[:-1], \"b--\", label=\"Precision\", linewidth=2)\n", + " plt.plot(thresholds, recalls[:-1], \"g-\", label=\"Recall\", linewidth=2)\n", + " plt.xlabel(\"Threshold\", fontsize=16)\n", + " plt.legend(loc=\"upper left\", fontsize=16)\n", + " plt.ylim([0, 1])\n", + "\n", + "plt.figure(figsize=(8, 4))\n", + "plot_precision_recall_vs_threshold(precisions, recalls, thresholds)\n", + "plt.xlim([-100000, 100000])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Likewise, we can plot the precision against the recall. The area under this curve indicates the performance. A good classifier will have a curve that sits in the top-right corner.\n", + "\n", + "We have defined a function to plot the curve." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x432 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_precision_vs_recall(precisions, recalls):\n", + " plt.plot(recalls, precisions, \"b-\", linewidth=2)\n", + " plt.xlabel(\"Recall\", fontsize=16)\n", + " plt.ylabel(\"Precision\", fontsize=16)\n", + " plt.axis([0, 1, 0, 1])\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plot_precision_vs_recall(precisions, recalls)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally we will plot the ROC curve. \n", + "\n", + "The ROC curve plots the *true positive rate* (another name for recall) against the *false positive rate (FPR)* \n", + "\n", + "Once again there is a trade-off: the higher the recall (TPR), the more false positive \n", + "(FPR) the classifier produces for the set of thresholds.\n", + "\n", + "We will define a function that plots the ROC curve." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x432 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from sklearn.metrics import roc_curve\n", + "#We need to calculate hte fpt,tpr and retrieve the thresholds\n", + "fpr, tpr, thresholds = roc_curve(y_train_5, y_scores) \n", + "\n", + "def plot_roc_curve(fpr, tpr, label=None):\n", + " plt.plot(fpr, tpr, linewidth=2, label=label)\n", + " plt.plot([0, 1], [0, 1], 'k--')\n", + " plt.axis([0, 1, 0, 1])\n", + " plt.xlabel('False Positive Rate', fontsize=16)\n", + " plt.ylabel('True Positive Rate', fontsize=16)\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plot_roc_curve(fpr, tpr)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can calculate the area under the curve (AUC) value using the builtin python function." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9604938554008616" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "roc_auc_score(y_train_5, y_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_19/Untitled.ipynb b/topic_19/Untitled.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7fec51502cbc3200b3d0ffc6bbba1fe85e197f3d --- /dev/null +++ b/topic_19/Untitled.ipynb @@ -0,0 +1,6 @@ +{ + "cells": [], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_19/topic_19.md b/topic_19/topic_19.md new file mode 100644 index 0000000000000000000000000000000000000000..cca4ff4b560dc750d1e733618590b07613b56f22 --- /dev/null +++ b/topic_19/topic_19.md @@ -0,0 +1,136 @@ +class: bottom, left +background-image: url(assets/g.png) + +<h2 class="title_headings_sml">COSC102 - Data Science Studio 1</h2> + +<h1 class="title_headings_sml"> Topic 19 - The Receiver Operating Characteristic Curve </h1> + +<h3 class="title_headings_sml"> Dr. Mitchell Welch </h3> + +--- + +## Reading + +* Chapter 3 from ***Hands-on Machine Learning with Scikit-Learn & TensorFlow*** + +--- + +## Summary + +* Sensitivity-Precision Trade-off +* The ROC Curve + +--- + +## Sensitivity-Precision Trade-off + +* Using the results from the confusion matrix, we can calculate some concise metrics. +* *Precision* - the accuracy of the positive predictions. + +$$ +precision = TP / (TP + FP) +$$ + +* *Recall* (Sometimes called *Sensitivity*) - The ratio of positive instances that are correctly detected by the classifier. + +$$ +recall = TP /(TP + FN) +$$ + + +--- +## Sensitivity-Precision Trade-off + +* When building a classifier, there is always a level of trade-off between the *recall* and *precision*. +* Unfortunately, when working with classification problems, you can’t have it both ways: increasing precision reduces recall, and vice versa. + +--- + +## Sensitivity-Precision Trade-off + +* Our KNN example from topic 1: Remember the coloured regions represent the decision boundaries. +* We can move the decision boundaries to achieve higher/lower recall, but this is done at the cost of precision. +.center[] + + +--- + +## Sensitivity-Precision Trade-off + +* In a perfect world we are aiming to train a classifier that classifies the instances in a way such that this trade-off is minimized. +* In the previous example, this means learning decision boundaries that are detailed enough to achieve a high sensitivity, while not scooping up lots of false positives that drive precision down. +* Within the inner workings of most classifier algorithms, the decision to assign an instance to a class is made based on assigning a *score*. +* This score is compared to an arbitrary *threshold* value that is used to decide the classification. + +--- + +## Sensitivity-Precision Trade-off + +* This threshold effectively represents the location of the decision boundary. +* So by adjusting the threshold, you can manipulate the tradeoff between recall and precision and assess the performance of the classifier. +* We can access the scores that the classifier assigns in the cross validation process and plot them understand the tradeoff. + +```python +y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function") +from sklearn.metrics import precision_recall_curve +precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores) +``` +--- + +## The ROC Curve + +* The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. +* The ROC curve plots the *true positive rate* (another name for recall) against the *false positive rate (FPR)* +* Once again there is a trade-off: the higher the recall (TPR), the more false positive +(FPR) the classifier produces. +* [Run the demo to produce the ROC curve for our data](). + +--- + +## The ROC Curve + +* Interpreting a ROC curve is fairly straight forward: +* The dotted line represents the ROC curve of a purely random classifier +* A good classifier stays as far away from that line as possible (toward the top-left corner). +* We assess this by calculating the *Area under the curve (AUC)* + +```python +from sklearn.metrics import roc_auc_score +roc_auc_score(y_train_5, y_scores) +``` + +--- + +## The ROC Curve + +* We have looked at two curves for assessing performance PR and ROC: + * As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives. + +* For example, looking at the previous ROC curve (and the ROC AUC score), you +may think that the classifier is really good. + * But this is mostly because there are few positives (5s) compared to the negative (non-5s). + +* The PR curve makes it clear that there is room for improvement. +--- + +## The ROC Curve + +* Take away points: + * Don't rely on accuracy alone as a measure of performance - especially with skewed datasets. + * The trade-off between the recall and precision determines the effectiveness of the classifier. + * PR and ROC curves provide a nice way to visualize this. + +--- + +## The ROC Curve + +* Sensitivity Precision Trade-off +* The ROC Curve +* ROC Implementation + +--- +## Reading + +* Chapter 3 from ***Hands-on Machine Learning with Scikit-Learn & TensorFlow*** + +--- \ No newline at end of file diff --git a/topic_19/topic_19_roc_demo.ipynb b/topic_19/topic_19_roc_demo.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7577e76aaf614eda1b3e28b0d2b47da8bc99502c --- /dev/null +++ b/topic_19/topic_19_roc_demo.ipynb @@ -0,0 +1,287 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Topic 19 -ROC Performance Demo\n", + "\n", + "In this demonstration we will continue on from where we started in topic 17. First we will train a classifier algorithm on the MNIST data set. Then we will calulate or performance metrics and analyse these using a ROC curve." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAGaElEQVR4nO3dPUiWfR/G8dveSyprs2gOXHqhcAh6hZqsNRqiJoPKRYnAoTGorWyLpqhFcmgpEmqIIByKXiAHIaKhFrGghiJ81ucBr991Z/Z4XPr5jB6cXSfVtxP6c2rb9PT0P0CeJfN9A8DMxAmhxAmhxAmhxAmhljXZ/Vcu/H1tM33RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCLZvvG+B//fr1q9y/fPnyVz9/aGio4fb9+/fy2vHx8XK/ceNGuQ8MDDTc7t69W167atWqcr948WK5X7p0qdzngycnhBInhBInhBInhBInhBInhBInhHLOOYMPHz6U+48fP8r92bNn5f706dOG29TUVHnt8PBwuc+nLVu2lPv58+fLfWRkpOG2du3a8tpt27aV+759+8o9kScnhBInhBInhBInhBInhBInhGqbnp6u9nJsVS9evCj3gwcPlvvffm0r1dKlS8v91q1b5d7e3j7rz960aVO5b9iwody3bt0668/+P2ib6YuenBBKnBBKnBBKnBBKnBBKnBBKnBBqUZ5zTk5Olnt3d3e5T0xMzOXtzKlm997sPPDx48cNtxUrVpTXLtbz3zngnBNaiTghlDghlDghlDghlDghlDgh1KL81pgbN24s96tXr5b7/fv3y33Hjh3l3tfXV+6V7du3l/vo6Gi5N3un8s2bNw23a9euldcytzw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IdSifJ/zT339+rXcm/24ut7e3obbzZs3y2tv375d7idOnCh3InmfE1qJOCGUOCGUOCGUOCGUOCGUOCHUonyf80+tW7fuj65fv379rK9tdg56/Pjxcl+yxL/HrcKfFIQSJ4QSJ4QSJ4QSJ4QSJ4Tyytg8+PbtW8Otp6envPbJkyfl/uDBg3I/fPhwuTMvvDIGrUScEEqcEEqcEEqcEEqcEEqcEMo5Z5iJiYly37lzZ7l3dHSU+4EDB8p9165dDbezZ8+W17a1zXhcR3POOaGViBNCiRNCiRNCiRNCiRNCiRNCOedsMSMjI+V++vTpcm/24wsrly9fLveTJ0+We2dn56w/e4FzzgmtRJwQSpwQSpwQSpwQSpwQSpwQyjnnAvP69ety7+/vL/fR0dFZf/aZM2fKfXBwsNw3b948689ucc45oZWIE0KJE0KJE0KJE0KJE0KJE0I551xkpqamyv3+/fsNt1OnTpXXNvm79M+hQ4fK/dGjR+W+gDnnhFYiTgglTgglTgglTgglTgjlKIV/beXKleX+8+fPcl++fHm5P3z4sOG2f//+8toW5ygFWok4IZQ4IZQ4IZQ4IZQ4IZQ4IdSy+b4B5tarV6/KfXh4uNzHxsYabs3OMZvp6uoq97179/7Rr7/QeHJCKHFCKHFCKHFCKHFCKHFCKHFCKOecYcbHx8v9+vXr5X7v3r1y//Tp02/f07+1bFn916mzs7PclyzxrPhvfjcglDghlDghlDghlDghlDghlDghlHPOv6DZWeKdO3cabkNDQ+W179+/n80tzYndu3eX++DgYLkfPXp0Lm9nwfPkhFDihFDihFDihFDihFDihFCOUmbw+fPncn/79m25nzt3rtzfvXv32/c0V7q7u8v9woULDbdjx46V13rla2753YRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQC/acc3JysuHW29tbXvvy5ctyn5iYmNU9zYU9e/aUe39/f7kfOXKk3FevXv3b98Tf4ckJocQJocQJocQJocQJocQJocQJoWLPOZ8/f17uV65cKfexsbGG28ePH2d1T3NlzZo1Dbe+vr7y2mbffrK9vX1W90QeT04IJU4IJU4IJU4IJU4IJU4IJU4IFXvOOTIy8kf7n+jq6ir3np6ecl+6dGm5DwwMNNw6OjrKa1k8PDkhlDghlDghlDghlDghlDghlDghVNv09HS1lyMwJ9pm+qInJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4Rq9iMAZ/yWfcDf58kJocQJocQJocQJocQJocQJof4DO14Dhyk10VwAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Precision: 0.8370879772350012\n", + "Recall: 0.6511713705958311\n", + "F1 Score: 0.7325171197343846\n" + ] + } + ], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]\n", + "y_train_5 = (y_train == 5) # True for all 5s, False for all other digits\n", + "y_test_5 = (y_test == 5)\n", + "\n", + "from sklearn.linear_model import SGDClassifier\n", + "\n", + "sgd_clf = SGDClassifier(random_state=42)\n", + "sgd_clf.fit(X_train, y_train_5)\n", + "\n", + "from sklearn.model_selection import cross_val_score\n", + "cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring=\"accuracy\")\n", + "\n", + "from sklearn.model_selection import cross_val_predict\n", + "y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)\n", + "\n", + "from sklearn.metrics import confusion_matrix\n", + "confusion_matrix(y_train_5, y_train_pred)\n", + "\n", + "from sklearn.metrics import precision_score, recall_score\n", + "print(\"Precision:\",precision_score(y_train_5, y_train_pred)) # == 3530 / (3530 + 1891)\n", + "print(\"Recall:\", recall_score(y_train_5, y_train_pred)) # == 3530 / (3530 + 687)\n", + "\n", + "from sklearn.metrics import f1_score\n", + "print(\"F1 Score:\",f1_score(y_train_5, y_train_pred))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can calculate the scores used to analyse the recall-precision trade-off. We will done by setting the method to \"decision_function\". The scores can then be used to calulate the precision and recalls for the range of thresholds used by the classifier." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,\n", + "method=\"decision_function\")\n", + "\n", + "from sklearn.metrics import precision_recall_curve\n", + "precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can plot the recalls and precisions against threhold values to understand the nature of the trade-off. We will define a function that will print the two data series. " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):\n", + " plt.plot(thresholds, precisions[:-1], \"b--\", label=\"Precision\", linewidth=2)\n", + " plt.plot(thresholds, recalls[:-1], \"g-\", label=\"Recall\", linewidth=2)\n", + " plt.xlabel(\"Threshold\", fontsize=16)\n", + " plt.legend(loc=\"upper left\", fontsize=16)\n", + " plt.ylim([0, 1])\n", + "\n", + "plt.figure(figsize=(8, 4))\n", + "plot_precision_recall_vs_threshold(precisions, recalls, thresholds)\n", + "plt.xlim([-100000, 100000])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Likewise, we can plot the precision against the recall. The area under this curve indicates the performance. A good classifier will have a curve that sits in the top-right corner.\n", + "\n", + "We have defined a function to plot the curve." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x432 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "def plot_precision_vs_recall(precisions, recalls):\n", + " plt.plot(recalls, precisions, \"b-\", linewidth=2)\n", + " plt.xlabel(\"Recall\", fontsize=16)\n", + " plt.ylabel(\"Precision\", fontsize=16)\n", + " plt.axis([0, 1, 0, 1])\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plot_precision_vs_recall(precisions, recalls)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally we will plot the ROC curve. \n", + "\n", + "The ROC curve plots the *true positive rate* (another name for recall) against the *false positive rate (FPR)* \n", + "\n", + "Once again there is a trade-off: the higher the recall (TPR), the more false positive \n", + "(FPR) the classifier produces for the set of thresholds.\n", + "\n", + "We will define a function that plots the ROC curve." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "<Figure size 576x432 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from sklearn.metrics import roc_curve\n", + "#We need to calculate hte fpt,tpr and retrieve the thresholds\n", + "fpr, tpr, thresholds = roc_curve(y_train_5, y_scores) \n", + "\n", + "def plot_roc_curve(fpr, tpr, label=None):\n", + " plt.plot(fpr, tpr, linewidth=2, label=label)\n", + " plt.plot([0, 1], [0, 1], 'k--')\n", + " plt.axis([0, 1, 0, 1])\n", + " plt.xlabel('False Positive Rate', fontsize=16)\n", + " plt.ylabel('True Positive Rate', fontsize=16)\n", + "\n", + "plt.figure(figsize=(8, 6))\n", + "plot_roc_curve(fpr, tpr)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can calculate the area under the curve (AUC) value using the builtin python function." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9604938554008616" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.metrics import roc_auc_score\n", + "roc_auc_score(y_train_5, y_scores)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_20/.ipynb_checkpoints/Untitled-checkpoint.ipynb b/topic_20/.ipynb_checkpoints/Untitled-checkpoint.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..7fec51502cbc3200b3d0ffc6bbba1fe85e197f3d --- /dev/null +++ b/topic_20/.ipynb_checkpoints/Untitled-checkpoint.ipynb @@ -0,0 +1,6 @@ +{ + "cells": [], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_20/Untitled.ipynb b/topic_20/Untitled.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..f2cc0e29750742a503d6304ce00b17220514d1a5 --- /dev/null +++ b/topic_20/Untitled.ipynb @@ -0,0 +1,124 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAOcAAADnCAYAAADl9EEgAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAGaElEQVR4nO3dPUiWfR/G8dveSyprs2gOXHqhcAh6hZqsNRqiJoPKRYnAoTGorWyLpqhFcmgpEmqIIByKXiAHIaKhFrGghiJ81ucBr991Z/Z4XPr5jB6cXSfVtxP6c2rb9PT0P0CeJfN9A8DMxAmhxAmhxAmhxAmhljXZ/Vcu/H1tM33RkxNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCiRNCLZvvG+B//fr1q9y/fPnyVz9/aGio4fb9+/fy2vHx8XK/ceNGuQ8MDDTc7t69W167atWqcr948WK5X7p0qdzngycnhBInhBInhBInhBInhBInhBInhHLOOYMPHz6U+48fP8r92bNn5f706dOG29TUVHnt8PBwuc+nLVu2lPv58+fLfWRkpOG2du3a8tpt27aV+759+8o9kScnhBInhBInhBInhBInhBInhGqbnp6u9nJsVS9evCj3gwcPlvvffm0r1dKlS8v91q1b5d7e3j7rz960aVO5b9iwody3bt0668/+P2ib6YuenBBKnBBKnBBKnBBKnBBKnBBKnBBqUZ5zTk5Olnt3d3e5T0xMzOXtzKlm997sPPDx48cNtxUrVpTXLtbz3zngnBNaiTghlDghlDghlDghlDghlDgh1KL81pgbN24s96tXr5b7/fv3y33Hjh3l3tfXV+6V7du3l/vo6Gi5N3un8s2bNw23a9euldcytzw5IZQ4IZQ4IZQ4IZQ4IZQ4IZQ4IdSifJ/zT339+rXcm/24ut7e3obbzZs3y2tv375d7idOnCh3InmfE1qJOCGUOCGUOCGUOCGUOCGUOCHUonyf80+tW7fuj65fv379rK9tdg56/Pjxcl+yxL/HrcKfFIQSJ4QSJ4QSJ4QSJ4QSJ4Tyytg8+PbtW8Otp6envPbJkyfl/uDBg3I/fPhwuTMvvDIGrUScEEqcEEqcEEqcEEqcEEqcEMo5Z5iJiYly37lzZ7l3dHSU+4EDB8p9165dDbezZ8+W17a1zXhcR3POOaGViBNCiRNCiRNCiRNCiRNCiRNCOedsMSMjI+V++vTpcm/24wsrly9fLveTJ0+We2dn56w/e4FzzgmtRJwQSpwQSpwQSpwQSpwQSpwQyjnnAvP69ety7+/vL/fR0dFZf/aZM2fKfXBwsNw3b948689ucc45oZWIE0KJE0KJE0KJE0KJE0KJE0I551xkpqamyv3+/fsNt1OnTpXXNvm79M+hQ4fK/dGjR+W+gDnnhFYiTgglTgglTgglTgglTgjlKIV/beXKleX+8+fPcl++fHm5P3z4sOG2f//+8toW5ygFWok4IZQ4IZQ4IZQ4IZQ4IZQ4IdSy+b4B5tarV6/KfXh4uNzHxsYabs3OMZvp6uoq97179/7Rr7/QeHJCKHFCKHFCKHFCKHFCKHFCKHFCKOecYcbHx8v9+vXr5X7v3r1y//Tp02/f07+1bFn916mzs7PclyzxrPhvfjcglDghlDghlDghlDghlDghlDghlHPOv6DZWeKdO3cabkNDQ+W179+/n80tzYndu3eX++DgYLkfPXp0Lm9nwfPkhFDihFDihFDihFDihFDihFCOUmbw+fPncn/79m25nzt3rtzfvXv32/c0V7q7u8v9woULDbdjx46V13rla2753YRQ4oRQ4oRQ4oRQ4oRQ4oRQ4oRQC/acc3JysuHW29tbXvvy5ctyn5iYmNU9zYU9e/aUe39/f7kfOXKk3FevXv3b98Tf4ckJocQJocQJocQJocQJocQJocQJoWLPOZ8/f17uV65cKfexsbGG28ePH2d1T3NlzZo1Dbe+vr7y2mbffrK9vX1W90QeT04IJU4IJU4IJU4IJU4IJU4IJU4IFXvOOTIy8kf7n+jq6ir3np6ecl+6dGm5DwwMNNw6OjrKa1k8PDkhlDghlDghlDghlDghlDghlDghVNv09HS1lyMwJ9pm+qInJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4QSJ4Rq9iMAZ/yWfcDf58kJocQJocQJocQJocQJocQJof4DO14Dhyk10VwAAAAASUVORK5CYII=\n", + "text/plain": [ + "<Figure size 432x288 with 1 Axes>" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "from sklearn.datasets import fetch_openml\n", + "mnist = fetch_openml('mnist_784', version=1)\n", + "mnist.keys()\n", + "\n", + "import matplotlib as mpl\n", + "import matplotlib.pyplot as plt\n", + "\n", + "X, y = mnist[\"data\"], mnist[\"target\"]\n", + "\n", + "some_digit = X[0]\n", + "some_digit_image = some_digit.reshape(28, 28)\n", + "plt.imshow(some_digit_image, cmap=\"binary\")\n", + "plt.axis(\"off\")\n", + "plt.show()\n", + "\n", + "import numpy as np\n", + "y = y.astype(np.uint8)\n", + "X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([3], dtype=uint8)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sklearn.linear_model import SGDClassifier\n", + "sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)\n", + "sgd_clf.fit(X_train, y_train)\n", + "sgd_clf.predict([some_digit])" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[-31893.03095419, -34419.69069632, -9530.63950739,\n", + " 1823.73154031, -22320.14822878, -1385.80478895,\n", + " -26188.91070951, -16147.51323997, -4604.35491274,\n", + " -12050.767298 ]])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sgd_clf.decision_function([some_digit])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/topic_20/topic_20.md b/topic_20/topic_20.md new file mode 100644 index 0000000000000000000000000000000000000000..a25930c9dde856ceab74e79e4af029b03254554c --- /dev/null +++ b/topic_20/topic_20.md @@ -0,0 +1,63 @@ +class: bottom, left +background-image: url(assets/g.png) + +<h2 class="title_headings_sml">COSC102 - Data Science Studio 1</h2> + +<h1 class="title_headings_sml"> Topic 20 - Multiclass Classification </h1> + +<h3 class="title_headings_sml"> Dr. Mitchell Welch </h3> + +--- + + +## Reading + +* Chapter 3 from ***Hands-on Machine Learning with Scikit-Learn & TensorFlow*** + +--- + +## Summary + +* Multiclass Classification +* Coded Demonstrations + +--- + +## Multiclass Classification + +* So far we have only looked at a binary problem: building a "5" detector for the MNIST dataset. +* We will expand on this with a full multiclass classifier. +* Recall that we have already divided the dataset and created the full multiclass dataset for training. +* This make things very easy: + +```python +sgd_clf.fit(X_train, y_train) +sgd_clf.predict([some_digit]) +``` + +--- + +## Multiclass Classification + +* Some machine learning algorithms are capable of multiclass classification without the need for additional processing. (e.g. Random forest, Naive bayes) +* Others are strictly binary classifiers that require additional strategies to extend them for the multi-class problem. + * One-vs-rest - train one classifier for each class + * One-vs-one - train one classifier for each distinct pair. +* As part of the classification process, a result is calculated from multiple binary classifiers. + +--- + +## Coded Demonstrations + +* To demonstrate the use of multiclass classification, we are going to review two coded examples using the *iris* dataset. + * [Example 1]() + * [Example 2]() + +--- + +## Summary + +* Multiclass Classification +* Coded Demonstrations + +--- \ No newline at end of file