Gradient Boosting In Forests

In this lesson we take to our computers and attempt to implement a worked example of the gradient boosting algorithm. For what purpose? To dress up the theory we just learnt and see first hand the impact of learning residuals from previous estimates.

Following the example of the last lecture, we implement the algorithm in code in get a feel for the Gradient Boosting process before pulling out the heavy machinary of XGBoost.

📑 Learning Objectives

Revisit the algorithm described in theory in the last lecture.
Implement the algorithm from scratch on a worked example.
Compare with the implementation of a random forest.

Coding a forest of tree stumps

Revisiting to the example in the last lesson, we set about solving a supervised learning problem using Gradient Boosted Trees and comparing the algorithm of the Random Forest.

The data, as before is given by simple quadratic curve, $y(x) = x^2.$ Our training data shall consist of 10 points on this curve. Let's set up an example. Our goal is to interpolate between these points with 100 test points.

import numpy as np

NUM_STUMPS = 4  # Number of stumps in each model
np.random.seed(31)  # Fix seed to reproduce example

def demo(x):
    """Our ground-truth function."""
    return x ** 2

# Training data
x_train = np.linspace(-1, 1, 10)
y_train = demo(x_train)

# Testing points
x_test = np.linspace(-1, 1, 100)

The training target is then the following list of numbers.

y_train = [1.00, 0.60, 0.31, 0.11, 0.01, 0.01, 0.11, 0.31, 0.60, 1.00]

In fact, for scikit-learn models, we should structure one-dimensional data as x_train = np.linspace(-1, 1, 10).reshape(-1, 1) so that is has shape x_train.shape = (num_samples, num_features).

A random forest of stumps

First, let's construct our baseline comparative model: a random forest of stumps. We'll train each stump in the forest on bagged data (recall this definition)--on the contrary, the gradient boosted tree shall be trained on the entire data set.

from sklearn.tree import DecisionTreeRegressor

BAG_NUM = int(0.6 * len(x_train))  # We'll bag 60% of the training set

# List to store stump models
stumps = list()

for s in range(NUM_STUMPS):
    x_sample = np.random.choice(x_train, BAG_NUM).reshape(-1, 1)

    # For a stump, we only want to split once;
    dt = DecisionTreeRegressor(max_depth=1)
    dt.fit(x_sample, demo(x_sample))

    # Append to model list
    stumps.append(dt)

This leaves us with a list of models in stumps which all make predictions via a single splitting. To form the aggregate model (the random forest), as this is a regression problem we take the mean of the predictions.

mean_pred = np.mean([dt.predict(x_test) for dt in stumps])

To reiterate, each dt in stumps is a decision tree with a single node. That means they each split the data set into two and assign a value to each rectangular region by averaging the training points there. Let's see what that looks like.

Figure 1. Using a collection of shallow trees, 'stumps', to form a random forest does not have enough flex to solve simple problems.

Let's check the Mean-Squared Error (MSE) for each stump, as well as the forest.

from sklearn.metrics import mean_squared_error

mse_rf_stumps
for dt in stumps:
    mse_rf_stumps.append(
        mean_squared_error(demo(x_test), dt.predict(x_test))
    )

mse_rf_mean = mean_squared_error(demo(x_test), mean_pred)

	Stump 1	Stump 2	Stump 3	Stump 4	Mean
MSE	0.074	0.194	0.194	0.201	0.042

The error of each stump is almost five times as high, and uniform across the stumps. We shall keep this metric in mind as we move to our next model.

Out of the woods with gradient boosting

Following the algorithm from the last lesson, let's add in the code needed to perform gradient boosting.

Let's call the first stump $T_0$ . This model shall be a straight forward decision tree, trained on the original set $(X, \mathbf{y})$ .

dt0 = DecisionTreeRegressor(max_depth=1)
dt0.fit(x_sample, demo(x_sample))

Compute the residual error $R_1 = \mathbf{y} - T_0$

r1 = y_train - dt.predict(x_train)

r1 = [0.00, 0.26, -0.03, -0.23, -0.32,
      -0.33, -0.23, -0.03,  0.26,  0.66]

Now we train the next tree stump $T_1$ on these residuals, so that the training data is $(X, R_1)$ . We are now modeling the difference between $T_0$ and the target $\mathbf{y}$ .

dt1 = DecisionTreeRegressor(max_depth=1)
dt1.fit(x_sample, r1)

Note that $T_0 + T_1 \approx T_0 + \mathbf{y} - T_0 = \mathbf{y}$ . This shall be our ensemble model. And the next residual is formed by the difference

R_2 = \mathbf{y} - (T_0 + T_1) = R_1 - T_1

In code we execute the following

r2 = r1 - dt.predict(x_train)

to obtain

r2 = [0.12,  0.38,  0.08, -0.12, -0.21,
      -0.21, -0.12,  0.08, -0.20,  0.20]

Now we can iterate this to a given number of stumps ( $N=100$ say), where the $i$ -th stump $T_i$ is is trained by using $X$ to predict the previous residuals $R_i = R_{i-1} - T_{i-1}$ at which stage the model is evaluated by $\sum_{i=0}^{n} T_{i}$ .

In our example, $N=4$ and we can compare the final model to the output of the random forest.

Figure 2. Applying Gradient Boosting to the stumps immediately shows far better convergence with the small number of parameters available.

By eye, it seems that the performance is very much better (see the fourth solid line). To be sure, let's numerically assess the performance.

	1 Stump	2 Stumps	3 Stumps	4 Stumps
MSE	0.074	0.039	0.022	0.019

As expected, in contrast to the independent random forest stumps, the MSE is decreasing. And with 4 stumps, the reported MSE of 0.019 is half as much as the MSE for the mean prediction of the forest.

Summary

There is great potential for speed-up in our first-principles implementation. And the great news is, that the XGBoost package provides just that, and a great deal more. Now we have solidified the ideas behind the algorithm, let's press on with deploying XGBoost in the wild.