Upload New File

4e493f8e · ashepley · 3e1367bf · 4e493f8e
Commit 4e493f8e authored 4 years ago by ashepley
--- a/topic_31/Random_Forests.ipynb
+++ b/topic_31/Random_Forests.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Random Forests - Classification and Regression\n",
+    "* Random forests is a supervised learning algorithm. \n",
+    "* It can be used both for classification and regression problems.\n",
+    "* Random forests fits a number of decision tree classifiers on randomly selected sub-samples of the dataset.\n",
+    "* It obtains a prediction from each tree, then uses voting to return an optimal result\n",
+    "* It uses averaging to improve the predictive accuracy and minimise over-fitting.\n",
+    "* Increasing the number of trees improves robustness\n",
+    "* Random forests uses randomly selected subsets of the training data to create decorrelated decision trees. This reduces variances, and improves robustness of the model, by ensuring the model isn't entirely dependent on any given strong predictor. \n",
+    "* Random forests is good for applications such as recommendation engines, image classification, predict patient outcomes based on symptoms, etc.\n",
+    "* If you were seeking recommendations to buy a product online, e.g. an air fryer, you would search and read online reviews. This can be compared to the decison tree part of the algorithm.\n",
+    "* You would then tally the votes for the most recommended products and base your decision on the majority vote. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 102,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import confusion_matrix\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "\n",
+    "def check_NaN(dataframe):\n",
+    "    print(\"Total NaN:\", dataframe.isnull().values.sum())\n",
+    "    print(\"NaN by column:\\n\",dataframe.isnull().sum())\n",
+    "    return\n",
+    "\n",
+    "def fillNaN_other(dataframe, key):\n",
+    "    dataframe[key].fillna('Other', inplace = True)\n",
+    "    return "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Classification Problem: Predict Whether a Credit Card Transaction is Fraudulent\n",
+    "Let's predict whether or not a transaction is fraudulent. The dataset used in this exercise can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 109,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = pd.read_csv(\"./datasets/creditcard.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 110,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Time</th>\n",
+       "      <th>V1</th>\n",
+       "      <th>V2</th>\n",
+       "      <th>V3</th>\n",
+       "      <th>V4</th>\n",
+       "      <th>V5</th>\n",
+       "      <th>V6</th>\n",
+       "      <th>V7</th>\n",
+       "      <th>V8</th>\n",
+       "      <th>V9</th>\n",
+       "      <th>...</th>\n",
+       "      <th>V21</th>\n",
+       "      <th>V22</th>\n",
+       "      <th>V23</th>\n",
+       "      <th>V24</th>\n",
+       "      <th>V25</th>\n",
+       "      <th>V26</th>\n",
+       "      <th>V27</th>\n",
+       "      <th>V28</th>\n",
+       "      <th>Amount</th>\n",
+       "      <th>Class</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>0.0</td>\n",
+       "      <td>-1.359807</td>\n",
+       "      <td>-0.072781</td>\n",
+       "      <td>2.536347</td>\n",
+       "      <td>1.378155</td>\n",
+       "      <td>-0.338321</td>\n",
+       "      <td>0.462388</td>\n",
+       "      <td>0.239599</td>\n",
+       "      <td>0.098698</td>\n",
+       "      <td>0.363787</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-0.018307</td>\n",
+       "      <td>0.277838</td>\n",
+       "      <td>-0.110474</td>\n",
+       "      <td>0.066928</td>\n",
+       "      <td>0.128539</td>\n",
+       "      <td>-0.189115</td>\n",
+       "      <td>0.133558</td>\n",
+       "      <td>-0.021053</td>\n",
+       "      <td>149.62</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>0.0</td>\n",
+       "      <td>1.191857</td>\n",
+       "      <td>0.266151</td>\n",
+       "      <td>0.166480</td>\n",
+       "      <td>0.448154</td>\n",
+       "      <td>0.060018</td>\n",
+       "      <td>-0.082361</td>\n",
+       "      <td>-0.078803</td>\n",
+       "      <td>0.085102</td>\n",
+       "      <td>-0.255425</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-0.225775</td>\n",
+       "      <td>-0.638672</td>\n",
+       "      <td>0.101288</td>\n",
+       "      <td>-0.339846</td>\n",
+       "      <td>0.167170</td>\n",
+       "      <td>0.125895</td>\n",
+       "      <td>-0.008983</td>\n",
+       "      <td>0.014724</td>\n",
+       "      <td>2.69</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-1.358354</td>\n",
+       "      <td>-1.340163</td>\n",
+       "      <td>1.773209</td>\n",
+       "      <td>0.379780</td>\n",
+       "      <td>-0.503198</td>\n",
+       "      <td>1.800499</td>\n",
+       "      <td>0.791461</td>\n",
+       "      <td>0.247676</td>\n",
+       "      <td>-1.514654</td>\n",
+       "      <td>...</td>\n",
+       "      <td>0.247998</td>\n",
+       "      <td>0.771679</td>\n",
+       "      <td>0.909412</td>\n",
+       "      <td>-0.689281</td>\n",
+       "      <td>-0.327642</td>\n",
+       "      <td>-0.139097</td>\n",
+       "      <td>-0.055353</td>\n",
+       "      <td>-0.059752</td>\n",
+       "      <td>378.66</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>1.0</td>\n",
+       "      <td>-0.966272</td>\n",
+       "      <td>-0.185226</td>\n",
+       "      <td>1.792993</td>\n",
+       "      <td>-0.863291</td>\n",
+       "      <td>-0.010309</td>\n",
+       "      <td>1.247203</td>\n",
+       "      <td>0.237609</td>\n",
+       "      <td>0.377436</td>\n",
+       "      <td>-1.387024</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-0.108300</td>\n",
+       "      <td>0.005274</td>\n",
+       "      <td>-0.190321</td>\n",
+       "      <td>-1.175575</td>\n",
+       "      <td>0.647376</td>\n",
+       "      <td>-0.221929</td>\n",
+       "      <td>0.062723</td>\n",
+       "      <td>0.061458</td>\n",
+       "      <td>123.50</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>2.0</td>\n",
+       "      <td>-1.158233</td>\n",
+       "      <td>0.877737</td>\n",
+       "      <td>1.548718</td>\n",
+       "      <td>0.403034</td>\n",
+       "      <td>-0.407193</td>\n",
+       "      <td>0.095921</td>\n",
+       "      <td>0.592941</td>\n",
+       "      <td>-0.270533</td>\n",
+       "      <td>0.817739</td>\n",
+       "      <td>...</td>\n",
+       "      <td>-0.009431</td>\n",
+       "      <td>0.798278</td>\n",
+       "      <td>-0.137458</td>\n",
+       "      <td>0.141267</td>\n",
+       "      <td>-0.206010</td>\n",
+       "      <td>0.502292</td>\n",
+       "      <td>0.219422</td>\n",
+       "      <td>0.215153</td>\n",
+       "      <td>69.99</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>5 rows × 31 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   Time        V1        V2        V3        V4        V5        V6        V7  \\\n",
+       "0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   \n",
+       "1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   \n",
+       "2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   \n",
+       "3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   \n",
+       "4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   \n",
+       "\n",
+       "         V8        V9  ...         V21       V22       V23       V24  \\\n",
+       "0  0.098698  0.363787  ...   -0.018307  0.277838 -0.110474  0.066928   \n",
+       "1  0.085102 -0.255425  ...   -0.225775 -0.638672  0.101288 -0.339846   \n",
+       "2  0.247676 -1.514654  ...    0.247998  0.771679  0.909412 -0.689281   \n",
+       "3  0.377436 -1.387024  ...   -0.108300  0.005274 -0.190321 -1.175575   \n",
+       "4 -0.270533  0.817739  ...   -0.009431  0.798278 -0.137458  0.141267   \n",
+       "\n",
+       "        V25       V26       V27       V28  Amount  Class  \n",
+       "0  0.128539 -0.189115  0.133558 -0.021053  149.62      0  \n",
+       "1  0.167170  0.125895 -0.008983  0.014724    2.69      0  \n",
+       "2 -0.327642 -0.139097 -0.055353 -0.059752  378.66      0  \n",
+       "3  0.647376 -0.221929  0.062723  0.061458  123.50      0  \n",
+       "4 -0.206010  0.502292  0.219422  0.215153   69.99      0  \n",
+       "\n",
+       "[5 rows x 31 columns]"
+      ]
+     },
+     "execution_count": 110,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 101,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total NaN: 0\n",
+      "NaN by column:\n",
+      " Time      0\n",
+      "V1        0\n",
+      "V2        0\n",
+      "V3        0\n",
+      "V4        0\n",
+      "V5        0\n",
+      "V6        0\n",
+      "V7        0\n",
+      "V8        0\n",
+      "V9        0\n",
+      "V10       0\n",
+      "V11       0\n",
+      "V12       0\n",
+      "V13       0\n",
+      "V14       0\n",
+      "V15       0\n",
+      "V16       0\n",
+      "V17       0\n",
+      "V18       0\n",
+      "V19       0\n",
+      "V20       0\n",
+      "V21       0\n",
+      "V22       0\n",
+      "V23       0\n",
+      "V24       0\n",
+      "V25       0\n",
+      "V26       0\n",
+      "V27       0\n",
+      "V28       0\n",
+      "Amount    0\n",
+      "Class     0\n",
+      "dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "check_NaN(data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Feature Selection\n",
+    "Because this is a real life dataset, the features used to predict whether a transaction is or is not fraudulent have been relabelled to protect the privacy of users. For more information on features, visit this website: https://www.kaggle.com/mlg-ulb/creditcardfraud"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 105,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "C:\\Users\\Andreas Shepley\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n",
+      "  \"\"\"Entry point for launching an IPython kernel.\n",
+      "C:\\Users\\Andreas Shepley\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n",
+      "  \n"
+     ]
+    }
+   ],
+   "source": [
+    "data.features = data[[\"Amount\",\"V28\",\"V27\",\"V26\"]]\n",
+    "data.target = data.Class"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 106,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "x_train, x_test, y_train, y_test = train_test_split(data.features,data.target, test_size=0.2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Go to https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for more information on the parameters you can pass into your random forest classifier. The two most important parameters are n_estimators and max_features.\n",
+    "* n_estimators is the number of trees in the forest. The default is 100 (int). Increasing the number of trees improves accuracy by improving the model's ability to generalise.\n",
+    "* max_features is the maximum number of features to consider when looking for the best split. Possible values include:{“auto”, “sqrt”, “log2”}, int or float, default=”auto”. \"auto\" is the same as \"sqrt\". \"sqrt\" means the maximum number of features will be equal to the square root of the number of features in the training set. If max_features== numbber of features in the dataset, you will end up with a bagged decision tree model, not a random forest. If max_features< number of features you will create a random forest. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 107,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model = RandomForestClassifier(n_estimators=100, max_features='sqrt')\n",
+    "fitted_model = model.fit(x_train, y_train)\n",
+    "predictions = fitted_model.predict(x_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 128,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[[56858     2]\n",
+      " [   82    20]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(confusion_matrix(y_test, predictions))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 129,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Accuracy: 99.85%\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Accuracy: {:0.2f}%\".format(accuracy_score(y_test, predictions)*100))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Visualisation of the importance of variables in classifying a transaction as fraudulent or not"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 130,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "No handles with labels found to put in legend.\n"
+     ]
+    },
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZwAAAEWCAYAAABSaiGHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAHU1JREFUeJzt3Xm8HGWd7/HPNyQkQA5LQlRCgBBkGVAIEFDHkUVwAAcGHBGCDBpEmUEHxHFXxuvl4jZ4vV5fzKjgQAhrBCQKsimSsAhCgiSRYQkkgMcgSzBkYQvkN3/U00mdk7PUWfrpdOf7fr36lepaf09353y7nqquUkRgZmZWb0MaXYCZmW0YHDhmZpaFA8fMzLJw4JiZWRYOHDMzy8KBY2ZmWThwrG4k/UjSv9V5GzMlfTwNnyjplgrL3Cjpo/Wsy8zW5cCxfpF0s6Szuxh/tKQ/SxoaEf8cEf8nV00RcVlE/G2F+Y6IiIsHe/uSDpLUPtjr7Q9J4yWFpKGDtL5e2yZpqqTXJK0oPY4fhG2HpLcOdD3WeA4c66+pwEmS1Gn8ScBlEfF6/pIMYLBCpp/+PSJGlh7TG1gLAJI2anQNVnDgWH/NAEYB76mNkLQVcCQwLT2fKumcNLy1pOslLZX0gqQ7JA1J0zp8g+203FZpueck/SUNj+uqIElTJN2Zhr/Q6Zv2KklT07RyN9wUSXdK+m5a/yJJR5TWuaOk2yUtl/RrSf8h6dIqL1DazjmSfptquE7SaEmXSVom6T5J40vzh6QzJC2U9Lykc0uv0RBJZ0l6UtKzkqZJ2iJNq+3NnCLpKeA3wO1ptUvTtt8laSdJv5G0JK3/Mklblrb/hKTPSZon6UVJ0yWNkLQZcCMwtvR6jq3yGpTWPVbSNel9XCTpjNK0/SXdnT4bT0s6T9LGaVqtHXNre0zl97nTa/fWNDxV0g8l3SBpJXCwpOHpPX5K0jMquns3SfN3+9m0weUX1folIl4Gfgp8pDT6OODhiJjbxSKfBdqBMcCbga8AVa6rNAS4CNgB2B54GTivQn1rvmkDfwU8l+rtyjuAR4CtgX8H/qu053Y5cC8wGvg6xR5cX0xOy2wL7ATcndozCngI+F+d5v8AMAnYBzga+FgaPyU9DgYmACNZ93U4kKKthwEHpHFbptfhbkDAt4Cxab7tUpvKjgMOB3YE9gSmRMRK4AhgcWnPZXHVFyD98b4OmJteh0OAMyUdlmZ5A/gMxev/rjT9kwARUWvHXn3cY/ow8A2gDbgT+A6wCzAReGuq42tp3v5+Nq2PHDg2EBcDH6p9U6QIn+6OjawCtgF2iIhVEXFHVLiQX0QsiYhrIuKliFhO8UfkwKoFptpmAP8/Im7oZrYnI+KCiHgj1b8N8GZJ2wP7AV+LiNci4k7gF1W3nVwUEY9HxIsUewmPR8SvU5fjVcDeneb/TkS8EBFPAd8HTkjjTwS+FxELI2IF8GVgsjp2n309IlamLwPriIjHIuJXEfFqRDwHfI91X8sfRMTiiHiBIiQm9rG9n0t7CkslPZ/G7QeMiYiz0+u4ELiAIoyJiDkRcU9EvB4RTwA/7qKuvvp5RNwVEauBV4FPAJ9Jr+1y4Ju17dPPz6b1nQPH+i39AX4OOFrSBIo/LJd3M/u5wGPALanL6EtVtiFpU0k/Tl1Jyyi6irZU9X75/wIeiYjv9DDPn2sDEfFSGhxJsSfwQmkcwB8rbrfmmdLwy108H9lp/vL6n0w1kP59stO0oRTfyCvVJulNkq6U9Kf0Wl5KsVdR9ufS8Etd1Neb70bElulRW/cOFN1xtSBaSrEX8eZU1y6pS+vPqa5vdlFXX5VfizHApsCc0vZvSuOhn59N6zsHjg3UNIo9m5OAWyLima5miojlEfHZiJgAHAX8q6RD0uSXKP4g1LylNPxZYFfgHRGxOWu7ijqfrLCO9IdjV+CUPrSn7GlglKRybdv1c11Vlde/PVDrulpM8Ye7PO11OgZYdDNc8600fs/0Wv4jFV7HHtZX1R+BRaUg2jIi2iLi/Wn6D4GHgZ1TXV/ppa6VlD4vkt7SxTzlep+nCPc9StvfInW39vbZtEHkwLGBmgYcStFl0e2pxpKOlPTWdGxkGUW//Rtp8gPAhyVtJOlwOnantFH8sVgqaRTrHvPobntHAGcAx3TXxdSbiHgSmA18XdLGkt5F8Qepnj6v4kSJ7YBPA7VjFlcAn1FxEsNIir2A6T2cDfgcsJrieE9NG7CC4rXcFvh8H+p6BhhdO1Ghj+4Flkn6oqRN0vv8Nkn7lepaBqyQtBtwWhfbLrdjLrCHpImSRrDucagOUrfaBcD/k/QmAEnb1o4h9fLZtEHkwLEBSX3uvwU2o+fjGzsDv6b4g3c38J8RMTNN+zTFH/KlFMcqZpSW+z6wCcW31HsoukKqOJ6iy+Sh0plVP6q4bNmJFAeylwDnUATAq/1YT1U/B+ZQhPAvKboEAS4ELqHoUlwEvAKc3t1KUjfgN4C7UjfSO4H/TXEywotp3T+rWlREPEwRegvT+iqfpZaOjR1FcTxoEcV7+ROgFl6fozjIv5wiGDqfGPB14OK03eMi4lHgbIrP0wKKkwJ680WKbrN7Urfdryn2fqHnz6YNIvnYmFl1kqZTnIlXaU+rj+sOim6lxwZ73WbrA+/hmPVA0n4qfr8yJHX3HU3HPTAzq6iRv0g2awZvoeh6Gk3xW43TIuL3jS3JrDm5S83MzLJwl5qZmWXhLrWSrbfeOsaPH9/oMszMmsqcOXOej4gxvc3nwCkZP348s2fPbnQZZmZNRdKTvc/lLjUzM8vEgWNmZlk4cMzMLAsfwzEzszVWrVpFe3s7r7zyyjrTRowYwbhx4xg2bFi/1u3AKXmofQn7fn5ao8swM8tqzrlr76PY3t5OW1sb48ePR6U7yEcES5Ysob29nR133LFf23GXmpmZrfHKK68wevToDmEDIInRo0d3uedTlQPHzMw66Bw2vY2vyoFjZmZZOHDMzCwLB46ZmXXQ3UWdB3qxZweOmZmtMWLECJYsWbJOuNTOUhsxYkS/1+3Tos3MbI1x48bR3t7Oc889t8602u9w+suBY2ZmawwbNqzfv7PpjbvUzMwsCweOmZll4cAxM7MsHDhmZpaFA8fMzLJw4JiZWRYOHDMzy8KBY2ZmWThwzMwsi6YIHEkzJR3WadyZkm6QdLekByXNk3R8afodkh5Ij8WSZuSv3MzMaprl0jZXAJOBm0vjJgNfBBZHxAJJY4E5km6OiKUR8Z7ajJKuAX6etWIzM+ugKfZwgKuBIyUNB5A0HhgL3B4RCwAiYjHwLDCmvKCkNuC9gPdwzMwaqCkCJyKWAPcCh6dRk4HpUbp+tqT9gY2Bxzst/gHg1ohYlqNWMzPrWlMETlLrViP9e0VtgqRtgEuAkyNidaflTijP25mkUyXNljT79ZeWD3LJZmZW00yBMwM4RNI+wCYRcT+ApM2BXwJnRcQ95QUkjQb2T9O7FBHnR8SkiJg0dNO2+lVvZraBa5rAiYgVwEzgQtIei6SNgWuBaRFxVReLfQi4PiJeyVWnmZl1rWkCJ7kC2Au4Mj0/DjgAmFI6BXpiaf4OXW9mZtY4zXJaNAARcS2g0vNLgUt7mP+gDGWZmVkFzbaHY2ZmTcqBY2ZmWThwzMwsCweOmZll4cAxM7MsHDhmZpaFA8fMzLJw4JiZWRYOHDMzy8KBY2ZmWThwzMwsCweOmZll4cAxM7MsHDhmZpaFA8fMzLJw4JiZWRZNdQO2evurcaOZfe5HGl2GmVlL8h6OmZll4cAxM7MsHDhmZpaFA8fMzLJw4JiZWRYOHDMzy8KBY2ZmWThwzMwsCweOmZll4cAxM7MsfGmbkteefpCnzn57o8swMxsU239tfqNL6MB7OGZmloUDx8zMsnDgmJlZFg4cMzPLwoFjZmZZOHDMzCwLB46ZmWXhwDEzsywcOGZmloUDx8zMsnDgmJlZFg4cMzPLwoFjZmZZOHDMzCwLB46ZmWXhwDEzsywcOGZmloUDx8zMsmiKwJE0U9JhncadKekGSXdLelDSPEnHl6YfIul+SQ9IulPSW/NXbmZmNU0ROMAVwORO4yYD3wE+EhF7AIcD35e0ZZr+Q+DEiJgIXA6clatYMzNbV7MEztXAkZKGA0gaD4wFbo+IBQARsRh4FhiTlglg8zS8BbA4Y71mZtbJ0EYXUEVELJF0L8VezM8p9m6mR0TU5pG0P7Ax8Hga9XHgBkkvA8uAd3a1bkmnAqcCbLvFsLq1wcxsQ9csezjQsVttcnoOgKRtgEuAkyNidRr9GeD9ETEOuAj4XlcrjYjzI2JSREwatdlGdSvezGxD10yBMwM4RNI+wCYRcT+ApM2BXwJnRcQ9adwYYK+I+F1adjrw1w2o2czMkqYJnIhYAcwELiTt3UjaGLgWmBYRV5Vm/wuwhaRd0vP3AQ/lq9bMzDprimM4JVcAP2Nt19pxwAHAaElT0rgpEfGApE8A10haTRFAH8tdrJmZrdVUgRMR1wIqPb8UuLSHea/NVJqZmfWiabrUzMysuTlwzMwsiz4HjqStJO1Zj2LMzKx1VQqcdC2zzSWNAuYCF0nq8nctZmZmXam6h7NFRCwD/gG4KCL2BQ6tX1lmZtZqqgbO0PRr/uOA6+tYj5mZtaiqgXM2cDPweETcJ2kCsKB+ZZmZWaup9Duc9Cv+q0rPFwIfrFdRZmbWeqqeNLCLpFsl/SE931OS7y9jZmaVVe1SuwD4MrAKICLmse4N0czMzLpVNXA2jYh7O417fbCLMTOz1lU1cJ6XtBPFXTSRdCzwdN2qMjOzllP14p2fAs4HdpP0J2ARcGLdqjIzs5bTa+BIGgJMiohDJW0GDImI5fUvzczMWkmvXWrpls3/koZXOmzMzKw/qh7D+ZWkz0naTtKo2qOulZmZWUtRRPQ+k7Soi9ERERMGv6TGmTRpUsyePbvRZZiZNRVJcyJiUm/zVb3SwI4DL8nMzDZklQJH0ke6Gh8R0wa3HDMza1VVT4verzQ8AjgEuB9w4JiZWSVVu9ROLz+XtAVwSV0qMjOzltTnW0wnLwE7D2YhZmbW2qoew7mOdFkbipDandLtCszMzHpT9RjOd0vDrwNPRkR7HeoxM7MWVbVL7f0RMSs97oqIdknfqWtlZmbWUqoGzvu6GHfEYBZiZmatrccuNUmnAZ8EJkiaV5rUBtxVz8LMzKy19Hhpm3T681bAt4AvlSYtj4gX6lxbdiO3Hxl7fX6vRpdhZi3urtNb6/v6oFzaJiJeBF4ETkgrfRPFDz9HShoZEU8NRrFmZtb6Kh3DkXSUpAUUN16bBTwB3FjHuszMrMVUPWngHOCdwKPpQp6H4GM4ZmbWB1UDZ1VELAGGSBoSEbcBE+tYl5mZtZiqP/xcKmkkcAdwmaRnKX4AamZmVknVPZyjKa6fdiZwE/A4cFS9ijIzs9ZT9WrRKyXtAOwcERdL2hTYqL6lmZlZK6l6ltongKuBH6dR2wIz6lWUmZm1nqpdap8C3g0sA4iIBcCb6lWUmZm1nqqB82pEvFZ7Imkoa29XYGZm1quqgTNL0leATSS9j+JeONfVrywzM2s1VQPnS8BzwHzgn4AbgLPqVZSZmbWe3q4WvX1EPBURq4EL0sPMzKzPetvDWXMmmqRr6lyLmZm1sN4CR6XhCfUsxMzMWltvgRPdDJuZmfVJb1ca2EvSMoo9nU3SMOl5RMTmda3OzMxaRo97OBGxUURsHhFtETE0DdeeZwsbSTMlHdZp3JmSbpB0t6QHJc2TdHxpuiR9Q9Kjkh6SdEaues3MbF1VrxbdaFcAk4GbS+MmA18EFkfEAkljgTmSbo6IpcAUYDtgt4hYne5WamZmDVL1dziNdjVwpKThAJLGA2OB29NldoiIxcCzwJi0zGnA2emUbiLi2cw1m5lZSVMETrr5273A4WnUZGB6RKw5kUHS/sDGFLdOANgJOF7SbEk3Stq5q3VLOjXNM3vVilX1a4SZ2QauKQInqXWrkf69ojZB0jbAJcDJtT0aYDjwSkRMovjB6oVdrTQizo+ISRExadjIYXUr3sxsQ9dMgTMDOETSPsAmEXE/gKTNgV8CZ0XEPaX524Haj1WvBfbMWayZmXXUNIETESuAmRR7KlcASNqYIkymRcRVnRaZAbw3DR8IPJqnUjMz60rTBE5yBbAXcGV6fhxwADBF0gPpMTFN+zbwQUnzgW8BH89erZmZrdEsp0UDEBHXUrrcTkRcClzazbxLgb/LVJqZmfWi2fZwzMysSTlwzMwsCweOmZll4cAxM7MsHDhmZpaFA8fMzLJw4JiZWRYOHDMzy8KBY2ZmWThwzMwsCweOmZll4cAxM7MsHDhmZpaFA8fMzLJw4JiZWRYOHDMzy6KpbsBWb7u9aTfuOv2uRpdhZtaSvIdjZmZZOHDMzCwLB46ZmWXhwDEzsywcOGZmloUDx8zMsnDgmJlZFg4cMzPLwoFjZmZZOHDMzCwLX9qmZPkjjzDrgAMbXYaZdeHA22c1ugQbIO/hmJlZFg4cMzPLwoFjZmZZOHDMzCwLB46ZmWXhwDEzsywcOGZmloUDx8zMsnDgmJlZFg4cMzPLwoFjZmZZOHDMzCwLB46ZmWXhwDEzsywcOGZmloUDx8zMsnDgmJlZFnUNHEkfkBSSdqvndnqp4UxJmzZq+2ZmVqj3Hs4JwJ3A5DpvpydnAg4cM7MGq1vgSBoJvBs4hRQ4kg6SNEvSTyU9Kunbkk6UdK+k+ZJ2SvPtIOlWSfPSv9un8VMlHVvaxorSemdKulrSw5IuU+EMYCxwm6Tb6tVWMzPrXT33cI4BboqIR4EXJO2Txu8FfBp4O3ASsEtE7A/8BDg9zXMeMC0i9gQuA35QYXt7U+zN7A5MAN4dET8AFgMHR8TBg9MsMzPrj3oGzgnAlWn4yvQc4L6IeDoiXgUeB25J4+cD49Pwu4DL0/AlwN9U2N69EdEeEauBB0rr6pGkUyXNljT7xVWrqixiZmb9MLQeK5U0Gngv8DZJAWwEBHAD8Gpp1tWl56t7qCfSv6+TQlKSgI1L85TX+0YP6+q44ojzgfMBdm1ri15mNzOzfqrXHs6xFF1iO0TE+IjYDlhEtT0VgN+y9kSDEylOPAB4Atg3DR8NDKuwruVAW8XtmplZndQrcE4Aru007hrgwxWXPwM4WdI8iuM8n07jLwAOlHQv8A5gZYV1nQ/c6JMGzMwaSxHuRarZta0tzt97n95nNLPsDrx9VqNLsG5ImhMRk3qbz1caMDOzLBw4ZmaWhQPHzMyycOCYmVkWDhwzM8vCgWNmZlk4cMzMLAsHjpmZZeHAMTOzLBw4ZmaWhQPHzMyycOCYmVkWDhwzM8vCgWNmZlk4cMzMLAsHjpmZZTG00QWsT9p23dU3eTIzqxPv4ZiZWRYOHDMzy8KBY2ZmWThwzMwsCweOmZlloYhodA3rDUnLgUcaXUedbQ083+gi6mxDaCNsGO10G5vDDhExpreZfFp0R49ExKRGF1FPkma7ja1hQ2in29ha3KVmZmZZOHDMzCwLB05H5ze6gAzcxtaxIbTTbWwhPmnAzMyy8B6OmZll4cAxM7MsNpjAkXS4pEckPSbpS11MHy5pepr+O0njS9O+nMY/IumwnHX3RX/bKGm8pJclPZAeP8pde1UV2niApPslvS7p2E7TPippQXp8NF/VfTPANr5Reh9/ka/qvqnQxn+V9N+S5km6VdIOpWmt8j721MameB/7LCJa/gFsBDwOTAA2BuYCu3ea55PAj9LwZGB6Gt49zT8c2DGtZ6NGt2mQ2zge+EOj2zBIbRwP7AlMA44tjR8FLEz/bpWGt2p0mwazjWnaika3YZDaeDCwaRo+rfRZbaX3scs2Nsv72J/HhrKHsz/wWEQsjIjXgCuBozvNczRwcRq+GjhEktL4KyPi1YhYBDyW1re+GUgbm0WvbYyIJyJiHrC607KHAb+KiBci4i/Ar4DDcxTdRwNpY7Oo0sbbIuKl9PQeYFwabqX3sbs2tqwNJXC2Bf5Yet6exnU5T0S8DrwIjK647PpgIG0E2FHS7yXNkvSeehfbTwN5L1rpfezJCEmzJd0j6ZjBLW3Q9LWNpwA39nPZRhlIG6E53sc+21AubdPVt/jO54N3N0+VZdcHA2nj08D2EbFE0r7ADEl7RMSywS5ygAbyXrTS+9iT7SNisaQJwG8kzY+IxweptsFSuY2S/hGYBBzY12UbbCBthOZ4H/tsQ9nDaQe2Kz0fByzubh5JQ4EtgBcqLrs+6HcbU3fhEoCImEPR97xL3Svuu4G8F630PnYrIhanfxcCM4G9B7O4QVKpjZIOBb4K/H1EvNqXZdcDA2ljs7yPfdfog0g5HhR7cgspDvrXDuDt0WmeT9HxgPpP0/AedDxpYCHr50kDA2njmFqbKA5y/gkY1eg29aeNpXmnsu5JA4soDjRvlYZbrY1bAcPT8NbAAjodqF4fHhU/q3tTfPHZudP4lnkfe2hjU7yP/XpdGl1Axg/A+4FH0xv81TTubIpvFgAjgKsoTgq4F5hQWvarablHgCMa3ZbBbiPwQeDB9J/ifuCoRrdlAG3cj+Lb5UpgCfBgadmPpbY/Bpzc6LYMdhuBvwbmp/dxPnBKo9sygDb+GngGeCA9ftGC72OXbWym97GvD1/axszMsthQjuGYmVmDOXDMzCwLB46ZmWXhwDEzsywcOGZmloUDx5pGpyvoPlC+oncf1rGlpE8OfnVr1j9F0nn1Wn832zxG0u45t1na9pslXS9pbrry8Q2NqMOagwPHmsnLETGx9HiiH+vYkuKq2X0iaaN+bKvu0hUjjqG4qnkjnE1xMc29ImJ3YJ3L8PdVapO1IAeONTVJG0k6V9J96b4i/5TGj0z3GLlf0nxJtSv1fhvYKe0hnSvpIEnXl9Z3nqQpafgJSV+TdCfwIUk7SbpJ0hxJd0jarZfapkr6oaTbJC2UdKCkCyU9JGlqab4Vkv5vqvVWSWPS+Inp4o3zJF0raas0fqakb0qaBXwR+Hvg3NSmnSR9Ir0ecyVdI2nTUj0/kPTbVM+xpRq+kF6nuZK+ncZVae82FD9CBSCKq1j3tM4qbfq0pDGp9vvS4909vdbWJBr9y1M//Kj6AN5g7a+yr03jTgXOSsPDgdkUlxMZCmyexm9N8at00eneP8BBwPWl5+cBU9LwE8AXStNuJV2GBHgH8JsuapwCnJeGp1Jclr52m4tlwNspvujNASam+QI4MQ1/rbT8PODANHw28P00PBP4z9I2p9LxEjejS8PnAKeX5rsqbX93isvnAxwB/Ja192YZ1Yf2HgYsBW6juCLH2F7WWbVNlwN/k4a3Bx5q9OfPj4E/vOtqzeTliJjYadzfAnuWvq1vAexM8a37m5IOoLhvzLbAm/uxzelQ7DFRXHLkKq29hdDwCstfFxEhaT7wTETMT+t7kCL8Hkj1TU/zXwr8TNIWwJYRMSuNv5giLDrU1Y23STqHovtwJHBzadqMiFgN/Lek2utxKHBRpHuzRMQLVdsbETenKxofThEyv5f0tm7W2Zc2HQrsXtr25pLaImJ5D+229ZwDx5qdKL7B39xhZNEtNgbYNyJWSXqC4lpynb1Ox67lzvOsTP8OAZZ2EXi9qV0BeHVpuPa8u/9/Va43tbKHaVOBYyJibnodDuqiHlh7CX11sc3K7Y2IFyj2SC5P3ZMHdLPO3pTbNAR4V0S83Md12HrMx3Cs2d0MnCZpGICkXSRtRrGn82wKm4OB2v3ilwNtpeWfpPgmPTx9Az+kq41EcW+gRZI+lLYjSXsNUhuGALU9tA8Dd0bEi8BftPZmeCcBs7pamHXb1AY8nV6TEyts/xbgY6VjPaOqtlfSe0vLtQE7AU91s86+tOkW4F9K2+lr0Nt6yHs41ux+QtE1db+K/pfnKM7augy4TtJsim6rhwGiuMncXZL+ANwYEZ+X9FOKYwsLgN/3sK0TgR9KOgsYRnF8Zu4gtGElsIekORR3YT0+jf8o8KP0R3shcHI3y18JXCDpDIrg+jfgdxRhOp+OYbSOiLgp/UGfLek14AbgK1Rr777AeZJqe4o/iYj7YE1IdF5n1TadAfyHpHkUf6duB/65p3bY+s9XizZrMEkrImJko+swqzd3qZmZWRbewzEzsyy8h2NmZlk4cMzMLAsHjpmZZeHAMTOzLBw4ZmaWxf8AM6yTnSf773QAAAAASUVORK5CYII=\n",
+      "text/plain": [
+       "<Figure size 432x288 with 1 Axes>"
+      ]
+     },
+     "metadata": {
+      "needs_background": "light"
+     },
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "%matplotlib inline\n",
+    "\n",
+    "feature_imp = pd.Series(model.feature_importances_,index=x_train.columns).sort_values(ascending=False)\n",
+    "\n",
+    "# Creating a bar plot\n",
+    "sns.barplot(x=feature_imp, y=feature_imp.index)\n",
+    "# Add labels to your graph\n",
+    "plt.xlabel('Feature Importance Score')\n",
+    "plt.ylabel('Features')\n",
+    "plt.title(\"Visualizing Important Features\")\n",
+    "plt.legend()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Advantages of using Random Forests\n",
+    "* It decorrelates features, even if the dataset has many highly correlated features, thus removing bias in the prediction.\n",
+    "* It can reduce error because it relies on a collection of decision trees, which means that it will use many trees to predict the outcome of one sample, minimising error and variance.\n",
+    "* It performs well on unbalanced datasets, e.g. when one class is is dominant, e.g >98% of transactions will be genuine, with <2% fraudulent\n",
+    "* Minimisation of impact of outliers on predictions because many trees are used\n",
+    "* Very good at generalisation, with minmal overfitting because only a random subset of features are used per tree, with many trees being used per prediction.\n",
+    "\n",
+    "### Exercise: \n",
+    "Try different combinations of trainig features to see if you can improve the accuracy of the model. Which features are the most important?"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+## Random Forests - Classification and Regression
+* Random forests is a supervised learning algorithm.
+* It can be used both for classification and regression problems.
+* Random forests fits a number of decision tree classifiers on randomly selected sub-samples of the dataset.
+* It obtains a prediction from each tree, then uses voting to return an optimal result
+* It uses averaging to improve the predictive accuracy and minimise over-fitting.
+* Increasing the number of trees improves robustness
+* Random forests uses randomly selected subsets of the training data to create decorrelated decision trees. This reduces variances, and improves robustness of the model, by ensuring the model isn't entirely dependent on any given strong predictor.
+* Random forests is good for applications such as recommendation engines, image classification, predict patient outcomes based on symptoms, etc.
+* If you were seeking recommendations to buy a product online, e.g. an air fryer, you would search and read online reviews. This can be compared to the decison tree part of the algorithm.
+* You would then tally the votes for the most recommended products and base your decision on the majority vote.
+
+%% Cell type:code id: tags:
+
+``` python
+import pandas as pd
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import confusion_matrix
+from sklearn.metrics import accuracy_score
+
+def check_NaN(dataframe):
+    print("Total NaN:", dataframe.isnull().values.sum())
+    print("NaN by column:\n",dataframe.isnull().sum())
+    return
+
+def fillNaN_other(dataframe, key):
+    dataframe[key].fillna('Other', inplace = True)
+    return
+```
+
+%% Cell type:markdown id: tags:
+
+### Classification Problem: Predict Whether a Credit Card Transaction is Fraudulent
+Let's predict whether or not a transaction is fraudulent. The dataset used in this exercise can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud
+
+%% Cell type:code id: tags:
+
+``` python
+data = pd.read_csv("./datasets/creditcard.csv")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+data.head()
+```
+
+%% Output
+
+       Time        V1        V2        V3        V4        V5        V6        V7  \
+    0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599
+    1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803
+    2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461
+    3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609
+    4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941
+    
+             V8        V9  ...         V21       V22       V23       V24  \
+    0  0.098698  0.363787  ...   -0.018307  0.277838 -0.110474  0.066928
+    1  0.085102 -0.255425  ...   -0.225775 -0.638672  0.101288 -0.339846
+    2  0.247676 -1.514654  ...    0.247998  0.771679  0.909412 -0.689281
+    3  0.377436 -1.387024  ...   -0.108300  0.005274 -0.190321 -1.175575
+    4 -0.270533  0.817739  ...   -0.009431  0.798278 -0.137458  0.141267
+    
+            V25       V26       V27       V28  Amount  Class
+    0  0.128539 -0.189115  0.133558 -0.021053  149.62      0
+    1  0.167170  0.125895 -0.008983  0.014724    2.69      0
+    2 -0.327642 -0.139097 -0.055353 -0.059752  378.66      0
+    3  0.647376 -0.221929  0.062723  0.061458  123.50      0
+    4 -0.206010  0.502292  0.219422  0.215153   69.99      0
+    
+    [5 rows x 31 columns]
+
+%% Cell type:code id: tags:
+
+``` python
+check_NaN(data)
+```
+
+%% Output
+
+    Total NaN: 0
+    NaN by column:
+     Time      0
+    V1        0
+    V2        0
+    V3        0
+    V4        0
+    V5        0
+    V6        0
+    V7        0
+    V8        0
+    V9        0
+    V10       0
+    V11       0
+    V12       0
+    V13       0
+    V14       0
+    V15       0
+    V16       0
+    V17       0
+    V18       0
+    V19       0
+    V20       0
+    V21       0
+    V22       0
+    V23       0
+    V24       0
+    V25       0
+    V26       0
+    V27       0
+    V28       0
+    Amount    0
+    Class     0
+    dtype: int64
+
+%% Cell type:markdown id: tags:
+
+#### Feature Selection
+Because this is a real life dataset, the features used to predict whether a transaction is or is not fraudulent have been relabelled to protect the privacy of users. For more information on features, visit this website: https://www.kaggle.com/mlg-ulb/creditcardfraud
+
+%% Cell type:code id: tags:
+
+``` python
+data.features = data[["Amount","V28","V27","V26"]]
+data.target = data.Class
+```
+
+%% Output
+
+    C:\Users\Andreas Shepley\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
+      """Entry point for launching an IPython kernel.
+    C:\Users\Andreas Shepley\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
+    
+
+%% Cell type:code id: tags:
+
+``` python
+x_train, x_test, y_train, y_test = train_test_split(data.features,data.target, test_size=0.2)
+```
+
+%% Cell type:markdown id: tags:
+
+Go to https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for more information on the parameters you can pass into your random forest classifier. The two most important parameters are n_estimators and max_features.
+* n_estimators is the number of trees in the forest. The default is 100 (int). Increasing the number of trees improves accuracy by improving the model's ability to generalise.
+* max_features is the maximum number of features to consider when looking for the best split. Possible values include:{“auto”, “sqrt”, “log2”}, int or float, default=”auto”. "auto" is the same as "sqrt". "sqrt" means the maximum number of features will be equal to the square root of the number of features in the training set. If max_features== numbber of features in the dataset, you will end up with a bagged decision tree model, not a random forest. If max_features< number of features you will create a random forest.
+
+%% Cell type:code id: tags:
+
+``` python
+model = RandomForestClassifier(n_estimators=100, max_features='sqrt')
+fitted_model = model.fit(x_train, y_train)
+predictions = fitted_model.predict(x_test)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+print(confusion_matrix(y_test, predictions))
+```
+
+%% Output
+
+    [[56858     2]
+     [   82    20]]
+
+%% Cell type:code id: tags:
+
+``` python
+print("Accuracy: {:0.2f}%".format(accuracy_score(y_test, predictions)*100))
+```
+
+%% Output
+
+    Accuracy: 99.85%
+
+%% Cell type:markdown id: tags:
+
+### Visualisation of the importance of variables in classifying a transaction as fraudulent or not
+
+%% Cell type:code id: tags:
+
+``` python
+import matplotlib.pyplot as plt
+import seaborn as sns
+%matplotlib inline
+
+feature_imp = pd.Series(model.feature_importances_,index=x_train.columns).sort_values(ascending=False)
+
+# Creating a bar plot
+sns.barplot(x=feature_imp, y=feature_imp.index)
+# Add labels to your graph
+plt.xlabel('Feature Importance Score')
+plt.ylabel('Features')
+plt.title("Visualizing Important Features")
+plt.legend()
+plt.show()
+```
+
+%% Output
+
+    No handles with labels found to put in legend.
+
+
+
+%% Cell type:markdown id: tags:
+
+### Advantages of using Random Forests
+* It decorrelates features, even if the dataset has many highly correlated features, thus removing bias in the prediction.
+* It can reduce error because it relies on a collection of decision trees, which means that it will use many trees to predict the outcome of one sample, minimising error and variance.
+* It performs well on unbalanced datasets, e.g. when one class is is dominant, e.g >98% of transactions will be genuine, with <2% fraudulent
+* Minimisation of impact of outliers on predictions because many trees are used
+* Very good at generalisation, with minmal overfitting because only a random subset of features are used per tree, with many trees being used per prediction.
+
+### Exercise:
+Try different combinations of trainig features to see if you can improve the accuracy of the model. Which features are the most important?