Supervised Learning - Decision Trees, Random Forest and eXtreme Gradient Boosting (XGBoost)

Supervised Learning - Decision Trees, Random Forest and eXtreme Gradient Boosting (XGBoost) #

What is a Decision Tree?#

Main Idea: organize the feature space in as a collection of units (in Euclidean space, rectangular boxes) such that all observations in each unit have some homogeneity (similar range of one or more features). As such, every observation goes into some unit and the predicted value is assigned by computing the mean (regression) or the votes (classification) over the entire unit.
General View: the input features can be a mix of discrete and continuous variables and the dependent variable can be either continuous (for regression problems) or discrete (for classification problems).
Important: This is an iterative process. The decisions are nodes in the tree, from each node we split two branches. The decisions are informed by objective functions (e.g. we can think of mean squared error for regression or the Gini impurity for classification). The final decisions are represented as piecewise defined, step functions.

Example: predict the mileage of the car by using its weight in 1000 lbs. (MTCars data set).

MPG prediction based on the weight of the car.

The following is a visualization in the feature-vs-target space:

Example: predict the mileage of the car by using weight and HP. (MTCars data set).

In a two-dimensional feature space, below we have a visualization example of the compartmentalization by decision trees:

Visualization of the step function created with decision trees.

Decision Trees for Regression#

Measuring Loss: for the objective function we can use either the mean absolute error or the mean square error.

Algorithm:

Decision Trees for Classification#

Example: A biologist recorded the weights of rabbits and squirrels in a data set. A decision tree starts with the root node such as if the weight is less than a threshold value, then inside the resulting subset (branch) there are more squirrels than rabbits.

Measuring Loss: The Gini impurity index measures the likelihood of missclassification in each node:

\[\large G:= \sum_{i=1}^{C} p(i)\cdot(1-p(i))= 1 - \sum_{i=1}^{C} p^2(i)\]

where \(C\) is the number of distinct classes and \(p(i)\) represents the probability of choosing and element of class \(i\) if one individual is randomly selected in the node.

Example of a Decision Tree for Classification.

For a binary classification problem, we can think of approximating two distinct probability distributions:

Random Forests#

Original Abstract (L. Breiman, 2001): Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges almost surely to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Freund and Schapire[1996]), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

Main Idea: grow many decision trees based on random samples of set of observations and then average the predictions across the trees in the forest for the final answer. One important aspect is that the random sampling is done with replacement (“bootstrap aggregation”).

Decision Boundaries Comparison#

An example of decision boundaries for Decsion Trees vs Random Forest.

Boosting#

Boosting (Schapire and Freund 2012) is a greedy algorithm for fitting adaptive basis-function where the weights are generated by an algorithm called a weak learner or a base learner. The algorithm works by applying the weak learner sequentially to weighted versions of the data, where more weight is given to examples that were misclassified by earlier rounds. This weak learner can be any classification or regression algorithm. In 1998, the late Leo Breiman called boosting, where the weak learner is a shallow decision tree, the “best off-the-shelf classifier in the world” (Hastie et al. 2009, p340). (Reference: K. Murphy - “Machine Learning - A Probilistic Perspective”, page 554.)

How Does Adaptive Boosting Work?#

We can understand the working of the AdaBoost algorithm in step by step manner as going deep into the work, we can see there are multiple basic steps which this algorithm follows. Let’s take a look at these steps.

1. When the algorithm is given data, it starts by assigning equal weights to all training examples in the dataset. These weights represent the importance of each sample during the training process.

2. Here, this algorithm iterates with a few algorithms for a specified number of iterations (or until a stopping criterion is met). The algorithm trains a weak classifier on the training data. Here the weak classifier can be considered a model that performs slightly better than random guessing, such as a decision stump (a one-level decision tree).

3. During each iteration, the algorithm trains the weak classifier on given training data with the current sample weights. The weak classifier aims to minimize the classification error, weighted by the sample weights.

4. After training the weak classifier, the algorithm calculates classifier weight based on the errors of the weak classifier. A weak classifier with a lower error receives a higher weight.

5. Once the calculation of weight completes, the algorithm updates sample weights, and the algorithm gives assigns higher weights to misclassified examples so that more importance in subsequent iterations can be given.

6. After updating the sample weights, they are normalized so that they sum up to 1 and Combine the predictions of all weak classifiers using a weighted majority vote. The weights of the weak classifiers are considered when making the final prediction.

7. Finally, Steps 2–5 are repeated for the specified number of iterations (or until the stopping criterion is met), with the sample weights updated at each iteration. The final prediction is obtained by aggregating the predictions of all weak classifiers based on their weights.

The below pseudocode can be helpful in understanding the working of the AdaBoost algorithm.

Initialize sample weights for each training example
For each iteration:
- Train a weak classifier using the current sample weights
- Calculate the error of the weak classifier
- Calculate the weight of the weak classifier based on the error
- Update the sample weights based on the weak classifier's performance
- Normalize the sample weights
End the iterations
Combine the weak classifiers using a weighted majority vote.

Reference: https://medium.com/@datasciencewizards/understanding-the-adaboost-algorithm-2e9344d83d9b

Gradient Boosting#

Assume you have an regressor \(F\) and, for the observation \(x_i\) we make the prediction \(F(x_i)\). To improve the predictions, we can regard \(F\) as a ‘weak learner’ and therefore train a decision tree (we can call it \(h\)) where the new output is \(y_i-F(x_i)\). So, the new predictor is trained on the residuals of the previous one. Thus, there are increased chances that the new regressor

\[\large F + h\]

is better than the old one, \(F.\)

Main task: implement this idea in an algorithm and test it on real data sets.

Gradient Boosting w/Two Steps for Regression

eXtreme Gradient Boosting (XGBoost)#

XGBoost is a powerful and widely used machine learning algorithm. It’s a specific implementation of gradient boosting, a technique that builds an ensemble of predictive models (typically decision trees) in a sequential manner. Each new model focuses on correcting the errors of the previous ones, leading to a final model that is often significantly more accurate than any of its individual components.

How Does XGBoost Work?

Initialization: XGBoost begins by creating a simple initial model, usually a single decision tree, that makes predictions based on the input features. This initial model will likely have errors.
Gradient Calculation: XGBoost then calculates the gradients of the loss function (a measure of how well the model is performing) with respect to the predictions of the current model. These gradients indicate the direction and magnitude of changes needed to improve the predictions.
New Tree Creation: A new decision tree is trained to predict these gradients. This tree essentially learns how to correct the errors of the previous model.
Tree Weighting: The new tree is added to the ensemble with a weight that determines how much influence it has on the final predictions. This weight is typically a small value to avoid overfitting.
Iteration: Steps 2-4 are repeated for a specified number of iterations or until the model’s performance stops improving. With each iteration, new trees are added to the ensemble, each focusing on correcting the residual errors of the previous trees.
Final Prediction: The final prediction is made by summing the weighted predictions of all the trees in the ensemble.

Key Features and Advantages of XGBoost:

Regularization: XGBoost includes regularization techniques (L1 and L2 regularization) to prevent overfitting, making it more robust to noise and outliers in the data.
Handling Missing Values: XGBoost can automatically handle missing values in the input data without requiring explicit imputation.
Parallelization: The algorithm can be parallelized across multiple cores or machines, making it scalable to large datasets.
Tree Pruning: XGBoost prunes trees during training to remove branches that don’t contribute significantly to the model’s accuracy, further preventing overfitting.
Feature Importance: It provides a measure of feature importance, allowing you to understand which features are most influential in making predictions. Where is XGBoost Used?

XGBoost is extremely versatile and is used in a wide range of applications, including:

Classification: Predicting categorical outcomes (e.g., customer churn, fraud detection, disease diagnosis) Regression: Predicting continuous outcomes (e.g., house prices, stock prices, sales forecasting) Ranking: Ranking items in order of relevance (e.g., search engine results, product recommendations) Important Considerations:

XGBoost can be sensitive to hyperparameter tuning, so careful optimization is often required. It may be computationally expensive for very large datasets.

Code Applications#

Setup#

import os
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  from google.colab import drive
  drive.mount('/content/drive')
  os.chdir('/content/drive/My Drive/Data Sets')
  !pip install -q pygam
  !pip install -q dtreeviz
else:
  print('Running locally')
  os.chdir('../Data')

Running on CoLab
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 91.8/91.8 kB 1.6 MB/s eta 0:00:00
?25h

# import libraries
import numpy as np
import pandas as pd
import pydot
from IPython.display import Image

from xgboost import XGBClassifier

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn import tree
import dtreeviz

Data import#

data_example_1 = pd.read_csv('example_data_classification.csv', header=None)
data_fusion_experiment = pd.read_csv('fusion_experiment.csv')

Example 1#

data_example_1.columns = ['Exam 1','Exam 2','Status']

data_example_1

	Exam 1	Exam 2	Status
0	34.623660	78.024693	0
1	30.286711	43.894998	0
2	35.847409	72.902198	0
3	60.182599	86.308552	1
4	79.032736	75.344376	1
...	...	...	...
95	83.489163	48.380286	1
96	42.261701	87.103851	1
97	99.315009	68.775409	1
98	55.340018	64.931938	1
99	74.775893	89.529813	1

100 rows × 3 columns

x = data_example_1[['Exam 1','Exam 2']]
y = data_example_1['Status']

model = tree.DecisionTreeClassifier(max_depth=4,min_samples_leaf=5,random_state=123)
model.fit(x,y)

DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

plt.figure(figsize=(8,8))
tree.plot_tree(model,feature_names = ['Grades E1','Grades E2'])
plt.savefig('DecisionTree_Example.svg', bbox_inches='tight')
plt.show()

_images/1baeca31820128b143e58018a1e4892d83b9cc89452d18a3946cf89a4276eb6e.png

def ShowTree(classifier, features, classes):
    dot_data = tree.export_graphviz(model, out_file=None, filled=True, rounded=True,
                special_characters=True, feature_names=x.columns, class_names=classes)
    (g,) = pydot.graph_from_dot_data(dot_data)
    #g.set_dpi('100')
    g.write_png('tree.png')
    return Image(g.create_png())

ShowTree(model, x.columns,['Not Admitted','Admitted'])

_images/d0f7eabcc671dce283db6456d63f08fcddcb76f608446f12066fe3d5eb290142.png

h = .1 # step size in the grid of points
cmap_light = ListedColormap(['#FFD0D7', 'lightcyan'])
cmap_bold = ListedColormap(['red', 'navy'])

# create a grid of values for the features' space
x_min, x_max = x.values[:, 0].min() - 1, x.values[:, 0].max() + 1
y_min, y_max = x.values[:, 1].min() - 1, x.values[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

Example of Decision Boundary from a Binary Tree#

# filter out the applicants that got admitted
admitted = data_example_1.loc[y == 1]

# filter out the applicants that din't get admission
not_admitted = data_example_1.loc[y == 0]


# show the decision boundary in the features' space
Z = model.predict(pd.DataFrame(np.c_[xx.ravel(), yy.ravel()],columns=['Exam 1','Exam 2']))

# Put the result into a color plot
Z = Z.reshape(xx.shape)
fig, ax = plt.subplots()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)


# Plot also the points
plt.scatter(admitted.iloc[:, 0], admitted.iloc[:, 1], color ='deepskyblue', s=25, label='Admitted',ec='k',alpha=0.5)
plt.scatter(not_admitted.iloc[:, 0] ,not_admitted.iloc[:, 1], color ='red', s=25,ec='k',alpha=0.5, label='Not Admitted')
plt.xlim(xx.min(), xx.max())
plt.xlabel('Grades Exam 1')
plt.ylabel('Grades Exam 2')
plt.ylim(yy.min(), yy.max())
ax.set_aspect('equal', 'box')
plt.title("Decision Tree - max_depth=4")
print('Accuracy : ' + str(accuracy_score(y,model.predict(x))))
plt.savefig('dtree.png',dpi=300)
plt.show()

Accuracy : 0.94

_images/26f135186e335424fdcb378c59903176fea12964eed1be6ef64221ae8ba54f79.png

Example of Decision Boundary from XGBoost#

model = XGBClassifier(n_estimators=10, max_depth = 4,random_state=123)
model.fit(x,y)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=10, n_jobs=None,
              num_parallel_tree=None, random_state=123, ...)