by Dr Andy Corbett
Gradient Boosting
12. Gradient Boosting by Hand: Code Example
Gradient Boosting In Forests
In this lesson we take to our computers and attempt to implement a worked example of the gradient boosting algorithm. For what purpose? To dress up the theory we just learnt and see first hand the impact of learning residuals from previous estimates.
Following the example of the last lecture, we implement the algorithm in code in get a feel for the Gradient Boosting process before pulling out the heavy machinary of XGBoost.
- Revisit the algorithm described in theory in the last lecture.
- Implement the algorithm from scratch on a worked example.
- Compare with the implementation of a random forest.
Coding a forest of tree stumps
Revisiting to the example in the last lesson, we set about solving a supervised learning problem using Gradient Boosted Trees and comparing the algorithm of the Random Forest.
The data, as before is given by simple quadratic curve, Our training data shall consist of 10 points on this curve. Let's set up an example. Our goal is to interpolate between these points with 100 test points.
import numpy as np
NUM_STUMPS = 4 # Number of stumps in each model
np.random.seed(31) # Fix seed to reproduce example
def demo(x):
"""Our ground-truth function."""
return x ** 2
# Training data
x_train = np.linspace(-1, 1, 10)
y_train = demo(x_train)
# Testing points
x_test = np.linspace(-1, 1, 100)
The training target is then the following list of numbers.
y_train = [1.00, 0.60, 0.31, 0.11, 0.01, 0.01, 0.11, 0.31, 0.60, 1.00]
In fact, for scikit-learn
models, we should structure one-dimensional data as
x_train = np.linspace(-1, 1, 10).reshape(-1, 1)
so that is has shape
x_train.shape = (num_samples, num_features)
.
A random forest of stumps
First, let's construct our baseline comparative model: a random forest of stumps. We'll train each stump in the forest on bagged data (recall this definition)--on the contrary, the gradient boosted tree shall be trained on the entire data set.
from sklearn.tree import DecisionTreeRegressor
BAG_NUM = int(0.6 * len(x_train)) # We'll bag 60% of the training set
# List to store stump models
stumps = list()
for s in range(NUM_STUMPS):
x_sample = np.random.choice(x_train, BAG_NUM).reshape(-1, 1)
# For a stump, we only want to split once;
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(x_sample, demo(x_sample))
# Append to model list
stumps.append(dt)
This leaves us with a list of models in stumps
which all make predictions via a single splitting. To form the aggregate model (the random forest), as this is a regression problem we take the mean of the predictions.
mean_pred = np.mean([dt.predict(x_test) for dt in stumps])
To reiterate, each dt
in stumps is a decision tree with a single node. That means they each split the data set into two and assign a value to each rectangular region by averaging the training points there. Let's see what that looks like.
Figure 1. Using a collection of shallow trees, 'stumps', to form a random forest does not have enough flex to solve simple problems.
Let's check the Mean-Squared Error (MSE) for each stump, as well as the forest.
from sklearn.metrics import mean_squared_error
mse_rf_stumps
for dt in stumps:
mse_rf_stumps.append(
mean_squared_error(demo(x_test), dt.predict(x_test))
)
mse_rf_mean = mean_squared_error(demo(x_test), mean_pred)
Stump 1 | Stump 2 | Stump 3 | Stump 4 | Mean | |
---|---|---|---|---|---|
MSE | 0.074 | 0.194 | 0.194 | 0.201 | 0.042 |
The error of each stump is almost five times as high, and uniform across the stumps. We shall keep this metric in mind as we move to our next model.
Out of the woods with gradient boosting
Following the algorithm from the last lesson, let's add in the code needed to perform gradient boosting.
- Let's call the first stump . This model shall be a straight forward decision tree, trained on the original set .
dt0 = DecisionTreeRegressor(max_depth=1)
dt0.fit(x_sample, demo(x_sample))
- Compute the residual error
r1 = y_train - dt.predict(x_train)
r1 = [0.00, 0.26, -0.03, -0.23, -0.32,
-0.33, -0.23, -0.03, 0.26, 0.66]
- Now we train the next tree stump on these residuals, so that the training data is . We are now modeling the difference between and the target .
dt1 = DecisionTreeRegressor(max_depth=1)
dt1.fit(x_sample, r1)
- Note that . This shall be our ensemble model. And the next residual is formed by the difference
In code we execute the following
r2 = r1 - dt.predict(x_train)
to obtain
r2 = [0.12, 0.38, 0.08, -0.12, -0.21,
-0.21, -0.12, 0.08, -0.20, 0.20]
- Now we can iterate this to a given number of stumps ( say), where the -th stump is is trained by using to predict the previous residuals at which stage the model is evaluated by .
In our example, and we can compare the final model to the output of the random forest.
Figure 2. Applying Gradient Boosting to the stumps immediately shows far better convergence with the small number of parameters available.
By eye, it seems that the performance is very much better (see the fourth solid line). To be sure, let's numerically assess the performance.
1 Stump | 2 Stumps | 3 Stumps | 4 Stumps | |
---|---|---|---|---|
MSE | 0.074 | 0.039 | 0.022 | 0.019 |
As expected, in contrast to the independent random forest stumps, the MSE is decreasing. And with 4 stumps, the reported MSE of 0.019 is half as much as the MSE for the mean prediction of the forest.
Summary
There is great potential for speed-up in our first-principles implementation. And the great news is, that the XGBoost package provides just that, and a great deal more. Now we have solidified the ideas behind the algorithm, let's press on with deploying XGBoost in the wild.