"## Random Forests - Classification and Regression\n",
"* Random forests is a supervised learning algorithm. \n",
"* It can be used both for classification and regression problems.\n",
"* Random forests fits a number of decision tree classifiers on randomly selected sub-samples of the dataset.\n",
"* It obtains a prediction from each tree, then uses voting to return an optimal result\n",
"* It uses averaging to improve the predictive accuracy and minimise over-fitting.\n",
"* Increasing the number of trees improves robustness\n",
"* Random forests uses randomly selected subsets of the training data to create decorrelated decision trees. This reduces variances, and improves robustness of the model, by ensuring the model isn't entirely dependent on any given strong predictor. \n",
"* Random forests is good for applications such as recommendation engines, image classification, predict patient outcomes based on symptoms, etc.\n",
"* If you were seeking recommendations to buy a product online, e.g. an air fryer, you would search and read online reviews. This can be compared to the decison tree part of the algorithm.\n",
"* You would then tally the votes for the most recommended products and base your decision on the majority vote. "
"### Classification Problem: Predict Whether a Credit Card Transaction is Fraudulent\n",
"Let's predict whether or not a transaction is fraudulent. The dataset used in this exercise can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud"
"Because this is a real life dataset, the features used to predict whether a transaction is or is not fraudulent have been relabelled to protect the privacy of users. For more information on features, visit this website: https://www.kaggle.com/mlg-ulb/creditcardfraud"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\Andreas Shepley\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n",
" \"\"\"Entry point for launching an IPython kernel.\n",
"C:\\Users\\Andreas Shepley\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access\n",
"Go to https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for more information on the parameters you can pass into your random forest classifier. The two most important parameters are n_estimators and max_features.\n",
"* n_estimators is the number of trees in the forest. The default is 100 (int). Increasing the number of trees improves accuracy by improving the model's ability to generalise.\n",
"* max_features is the maximum number of features to consider when looking for the best split. Possible values include:{“auto”, “sqrt”, “log2”}, int or float, default=”auto”. \"auto\" is the same as \"sqrt\". \"sqrt\" means the maximum number of features will be equal to the square root of the number of features in the training set. If max_features== numbber of features in the dataset, you will end up with a bagged decision tree model, not a random forest. If max_features< number of features you will create a random forest. "
"plt.title(\"Visualizing Important Features\")\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Advantages of using Random Forests\n",
"* It decorrelates features, even if the dataset has many highly correlated features, thus removing bias in the prediction.\n",
"* It can reduce error because it relies on a collection of decision trees, which means that it will use many trees to predict the outcome of one sample, minimising error and variance.\n",
"* It performs well on unbalanced datasets, e.g. when one class is is dominant, e.g >98% of transactions will be genuine, with <2% fraudulent\n",
"* Minimisation of impact of outliers on predictions because many trees are used\n",
"* Very good at generalisation, with minmal overfitting because only a random subset of features are used per tree, with many trees being used per prediction.\n",
"\n",
"### Exercise: \n",
"Try different combinations of trainig features to see if you can improve the accuracy of the model. Which features are the most important?"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
%% Cell type:markdown id: tags:
## Random Forests - Classification and Regression
* Random forests is a supervised learning algorithm.
* It can be used both for classification and regression problems.
* Random forests fits a number of decision tree classifiers on randomly selected sub-samples of the dataset.
* It obtains a prediction from each tree, then uses voting to return an optimal result
* It uses averaging to improve the predictive accuracy and minimise over-fitting.
* Increasing the number of trees improves robustness
* Random forests uses randomly selected subsets of the training data to create decorrelated decision trees. This reduces variances, and improves robustness of the model, by ensuring the model isn't entirely dependent on any given strong predictor.
* Random forests is good for applications such as recommendation engines, image classification, predict patient outcomes based on symptoms, etc.
* If you were seeking recommendations to buy a product online, e.g. an air fryer, you would search and read online reviews. This can be compared to the decison tree part of the algorithm.
* You would then tally the votes for the most recommended products and base your decision on the majority vote.
print("NaN by column:\n",dataframe.isnull().sum())
return
deffillNaN_other(dataframe,key):
dataframe[key].fillna('Other',inplace=True)
return
```
%% Cell type:markdown id: tags:
### Classification Problem: Predict Whether a Credit Card Transaction is Fraudulent
Let's predict whether or not a transaction is fraudulent. The dataset used in this exercise can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud
Because this is a real life dataset, the features used to predict whether a transaction is or is not fraudulent have been relabelled to protect the privacy of users. For more information on features, visit this website: https://www.kaggle.com/mlg-ulb/creditcardfraud
%% Cell type:code id: tags:
``` python
data.features=data[["Amount","V28","V27","V26"]]
data.target=data.Class
```
%% Output
C:\Users\Andreas Shepley\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
"""Entry point for launching an IPython kernel.
C:\Users\Andreas Shepley\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
Go to https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html for more information on the parameters you can pass into your random forest classifier. The two most important parameters are n_estimators and max_features.
* n_estimators is the number of trees in the forest. The default is 100 (int). Increasing the number of trees improves accuracy by improving the model's ability to generalise.
* max_features is the maximum number of features to consider when looking for the best split. Possible values include:{“auto”, “sqrt”, “log2”}, int or float, default=”auto”. "auto" is the same as "sqrt". "sqrt" means the maximum number of features will be equal to the square root of the number of features in the training set. If max_features== numbber of features in the dataset, you will end up with a bagged decision tree model, not a random forest. If max_features< number of features you will create a random forest.
* It decorrelates features, even if the dataset has many highly correlated features, thus removing bias in the prediction.
* It can reduce error because it relies on a collection of decision trees, which means that it will use many trees to predict the outcome of one sample, minimising error and variance.
* It performs well on unbalanced datasets, e.g. when one class is is dominant, e.g >98% of transactions will be genuine, with <2% fraudulent
* Minimisation of impact of outliers on predictions because many trees are used
* Very good at generalisation, with minmal overfitting because only a random subset of features are used per tree, with many trees being used per prediction.
### Exercise:
Try different combinations of trainig features to see if you can improve the accuracy of the model. Which features are the most important?