Dr Andy Corbett

by Dr Andy Corbett

Lesson

Gradient Boosting

11. Auto-Correction in a Forest of Stumps

Gradient Boosting In Forests

Our journey has taken us to the most powerful out-of-the-box machine learning algorithm: gradient-boosted ensemble models. We've already seen one ensemble model: Random forests, which is an ensemble of decision trees. However the next algorithm under review today outperforms random forests on all machine-learning league tables: gradient boosted trees. This is always worth pulling out of your back pocket and applied first to new data problems. Why? It's performance and ease of use is simply unparalleled. In this tutorial we'll begin by gently unpacking the theory behind the algorithm.

📑 Learning Objectives
  • Introduce the notion of Gradient Boosted Trees.
  • Explain terminology: 'weak learners', 'stumps', 'residuals', 'gradients'.
  • Write down the mathematical trick that drives the algorithm.
  • Walk through the algorithm in its most basic form.
Shining a Light on the Stumps

From trees to stumps with autocorrection


Our journey has brought us from trees to forests and out the other side, a field full of stumps. When considering a collection of random forests, we decided that they should be trained independently, and thus also simultaneously. Then the final answer would be formed by averaging over all the outputs in the forest. This method required us to grow trees in proportion to the size (resp. square-root of the size) of the data set for regression problems (resp. classification).

Gradient boosted trees are trained in sequence, each dependent on the previous tree: each tree learns the error in the previous prediction (the residual in regression problems), rather than recreating the data itself. This means, if a data point prediction is miss-predicted by the model, the next tree shall divert attention there. At the same time, all correct predictions would have a small error and be grouped together around zero.

This is so effective, we are able to reduce the size of the tree to a stump: a decision tree with a single splitting. These small intermediate models, learning residuals, are formally known as 'weak learners'. Furthermore, we aren't tied to using just tree based-models. The core concept works well with any model.

Walking though the algorithm with a visual example


Let's define the algorithm by example: a simple quadratic curve, y(x)=x2.y(x) = x^2. For comparison, let's first reconstruct a random forest of stumps to see how it fares.

Our training data shall consist of 10 points on this curve. Let's set up an example. Our goal is to interpolate between these points with 100 test points.

The training target is then the following list of numbers:

y=[1.00,0.60,0.31,0.11,0.01,0.01,0.11,0.31,0.60,1.00]\mathbf{y} = [1.00, 0.60, 0.31, 0.11, 0.01, 0.01, 0.11, 0.31, 0.60, 1.00]

A random forest of stumps

First, let's construct our base comparative model: a random forest of stumps. We'll train each stump in the forest on bagged data (recall this definition)--on the contrary, the gradient boosted tree shall be trained on the entire data set.

This leaves us with a list of models, the 'stumps' which all make predictions via a single splitting. To form the aggregate model (the random forest), as this is a regression problem we take the mean of the predictions.

A Random Forest of Stumps

Figure 1. Using a collection of shallow trees, 'stumps', to form a random forest does not have enough flex to solve simple problems.

This is not an impressive model. The locations of the splittings impact the model drastically. The shape of the error function forces those splitting toward the sides leaving a void in the centre. Averaging a collection of stumps is simply too coarse for this data set.

Let's check the Mean-Squared Error (MSE) for each stump, as well as the forest.

Stump 1Stump 2Stump 3Stump 4Mean
MSE:0.0740.1940.1940.2010.042

The error of each stump is almost five times as high, and uniform across the stumps. We shall keep this metric in mind as we move to our next model.

Out of the woods with gradient boosting

Now let's walk through the gradient boosting algorithm. Each stump is dependent on the previous, and hence each previous. We shall denote our dataset by (X,y)(X, \mathbf{y}), where we are trying to predict the vector y\mathbf{y} from the columns of XX.

  1. Let's call the first stump T0T_0. This model shall be a straight forward decision tree, trained on the original set (X,y)(X, \mathbf{y}).

  2. Compute the residual error R1=y−T0R_1 = \mathbf{y} - T_0

R1=[0.00,0.26,−0.03,−0.23,−0.32,−0.33,−0.23,−0.03,0.26,0.66]R_1 = [0.00, 0.26, -0.03, -0.23, -0.32, -0.33, -0.23, -0.03, 0.26, 0.66]

💭 The MSE for T0T_0 is the sum of the squares of numbers in R1R_1. Keep this in mind for later.

  1. Now we train the next tree stump T1T_1 on these residuals, so that the training data is (X,R1)(X, R_1). We are now modeling the difference between T0T_0 and the target y\mathbf{y}.

  2. Note that T0+T1≈T0+y−T0=yT_0 + T_1 \approx T_0 + \mathbf{y} - T_0 = \mathbf{y}. This shall be our ensemble model. And the next residual is formed by the difference R2=y−(T0+T1)=R1−T1R_2 = \mathbf{y} - (T_0 + T_1) = R_1 - T_1 so that

R2=[0.12,0.38,0.08,−0.12,−0.21,−0.21,−0.12,0.08,−0.20,0.20]R_2 = [0.12, 0.38, 0.08, -0.12, -0.21, -0.21, -0.12, 0.08, -0.20, 0.20]
  1. Now we can iterate this to a given number of stumps (N=100N=100 say), where the ii-th stump TiT_i is is trained by using XX to predict the previous residuals Ri=Ri−1−Ti−1R_i = R_{i-1} - T_{i-1} at which stage the model is evaluated by ∑i=0nTi\sum_{i=0}^{n} T_{i}.

In our example, N=4N=4 and we can compare the final model to the output of the random forest.

Gradient Boosting Stumps

Figure 2. Applying Gradient Boosting to the stumps immediately shows far better convergence with the small number of parameters available.

By eye, it seems that the performance is very much better (see the fourth solid line). To be sure, let's numerically assess the performance.

1 Stump2 Stumps3 Stumps4 Stumps
MSE0.0740.0390.0220.019

As expected, in contrast to the independent random forest stumps, the MSE is decreasing. And with 4 stumps, the reported MSE of 0.019 is half as much as the MSE for the mean prediction of the forest.

More general gradients This construction is theoretically delightful. If were to impose a mean squared loss on our model,

L(Tn)=12N∑i=1N∥Tn−Rn∥2,L(T_n) = \frac{1}{2N}\sum_{i=1}^{N}\|T_n - R_n\|^2,

then with a gradient boosted model Fn=∑i=1nTiF_n=\sum_{i=1}^{n} T_i, the gradient of this loss is precisely equal to the residuals:

∂L(Tn)∂Tn=Rn+1\frac{\partial L(T_{n})}{\partial T_n} = R_{n+1}

and so we could pick a small real number α\alpha, a learning rate and perform our model update Fn↦Fn+1=Fn+αTn+1F_n\mapsto F_{n+1} = F_{n} + \alpha T_{n+1}, scaled by the learning rate, we recover the update formula used in gradient decent. And so this algorithm fits into a common optimisation framework. We can then also justify taking more general loss functions LL to use on bespoke problems.