Dr Andy Corbett

by Dr Andy Corbett

Lesson

Random Forests

7. Random Forests: Decisions Don't Fall Far from the Decision Tree

Random forests are powerful and popular machine learning tools. They generalise the notion of a decision tree via a statistical approach known as ensemble modelling. In this lesson we explore the structure and keep concepts underpinning a random forest model.

📑 Learning Objectives
  • Fix the over-fitting problem encountered with decision trees.
  • Understanding of model democracy by bootstrap aggregation, or "bagging".
  • Advantages and disadvantages of random forests.
Democracy in a Forest

Decisions never fall far from the decision tree


Over-fitting: it's one of the central problems in machine learning. How do we measure it? A tell-tell sign of over-fitting is when your training accuracy is sky high (say 99.9%) but the testing accuracy is much worse ( much less than 80%). Decision trees are prone to over-fitting because at each node, the tree aims to split the data into different bins. With enough of these bins, the tree will almost certainly learn to describe each data point perfectly (Fig. 1).

Overfit Tree

Figure. 1. An over-fit decision tree modelling noisy data containing 50 training samples.

How could we fix this issue? One option is to simply reduce the degrees of freedom in the model by limiting number of leaves. This is an important parameter to adjust; and good rules of thumb are to constrain the of number of leaf nodes to roughly the square root of the number of training data points for classification, or take the whole set for regression.

But our estimations are still extremely rigid to the dataset. For example, in Fig. 1 we can see two extremes:

  • With 50 leaf nodes (as many as there are data samples), the model splits each input point into its own region.
  • With just three leaf nodes, the data is averaged in three separate regions.

In either cases, the usefulness of the model is questionable when assessing out-of-sample test data points.

Constructing a random forest


A random forest attempts to interpolate between trained predictions by combining the output of many trees. This is achieved via a statistical technique called bootstrap aggregation, a.k.a. 'bagging'.

Bootstrapping: Each tree in the forest is trained on a subset of the original training set. Each subset is drawn at random and with replacement. This means that each bootstrapped subset choose its samples at random, and possible with duplication. The significance of this is that samples within the subset are independent.

Out-of-bag samples: For each subset, there shall be a number of samples from the original set which are not selected. These are kept aside and can be used during training to find the optimal number of trees to use.

Aggregation: Once the separate trees are trained--and this may be completed in parallel--the final prediction is produced by taking the mean value (for regression) or the majority vote (for classification).

All together, the following steps describe a random forest algorithm:

  1. Choose your number of trees, n_estimators. Defaulting to 100 is a good bet.
  2. Choose your maximum number of samples, max_samples, to be drawn for each subset.
  3. For t = 1, ..., n_estimators, draw 'max_samples' samples with replacement, forming a bootstrapped subset. The total in each bootstrapped subset may be less than max_samples due to repetition.
  4. Train n_estimators decision trees.
  5. Aggregate the output of the trees to form a prediction.
Bagging

Figure. 2. Diagram of a Random Forest algorithm, bagging the data amongst three trees.

Pros and cons of a random forest model


  • A key advantage of random forests is that they may be computed in parallel: each tree is independent of the other, so training for all trees may be conducted simultaneously. Another is that they typically robust to over fitting, meaning they perform better on out-of-sample data.

  • A disadvantage is that with many trees, and hence many parameters, they become perform slowly during inference compared with other techniques. Nevertheless, this scales with data size and complexity. So on smaller datasets they are good performers; and on such dataset the robustness to ovberfitting is crucial.