by Dr Andy Corbett
Random Forests
8. Random Forests out in the Wild
Download the resources for this lesson here.
In this lesson we deploy our new-found knowledge about random forest on a real-world problem: predicting the compressive strength of concrete from measurable factors. We compare performance with a decision tree and a linear regressor. And we conclude by assessing the choice of hyper parameters attached to the forset, evaluating different choices on 'out-of-bag' data.
- Explore and load the 'Concrete Compressive Strength' dataset.
- Test the predictive power of individual columns.
- Train a linear regressor and decision tree, evaluating performance.
- Train a random forest model from
scikit-learn
. - Evaluat the hyper-parameter selection on out-of-bag data.
Now let's get down to business. We'll demonstrate an example problem to see how a random forest compares against the lonely decision tree.
Computing concrete compressive strength from physical factors
I'm no expert in concrete manufacture. However the answer to such engineering problems very often hides in the data. I'll use our new friend, the random forest, to do the heavy lifting.
First, let's get the data.
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/' \
'compressive/Concrete_Data.xls'
df = pd.read_excel(url) # Requires install of `xlrd` package
print('df shape: ', df.shape)
df.info()
Calling df.info()
prints some basic properties of the columns in our data set.
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
# Column Dtype
--- ------ -----
0 Cement (component 1)(kg in a m^3 mixture) float64
1 Blast Furnace Slag (component 2)(kg in a m^3 mixture) float64
2 Fly Ash (component 3)(kg in a m^3 mixture) float64
3 Water (component 4)(kg in a m^3 mixture) float64
4 Superplasticizer (component 5)(kg in a m^3 mixture) float64
5 Coarse Aggregate (component 6)(kg in a m^3 mixture) float64
6 Fine Aggregate (component 7)(kg in a m^3 mixture) float64
7 Age (day) int64
8 Concrete compressive strength(MPa, megapascals) float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
Our task is to predict the ninth column, 'concrete compressive strength' using the previous eight as indicators. This column is measured on the real-number line, which means this is a regression problem (as opposed to classification).
Let's examine any predictive potential between pairs of features (most interested in pairs containing the 'Compressive strength' column) using the PPS score.
Figure 1. The Predictive Power Score showing one-on-one relationships between variables.
Here we are using the x-axis to predict the y axis. Notice that none of the variables are able to predict the age of the concrete--I think we can agree on that without calling an expert. But more important, none of the individual features are able to predict the 'Compressive strength' with any degree of confidence. The closest single predictor would be the 'Age' variable, with a score of 0.24. But this is not statistically significant, as we can observe by plotting 'Age' vs. 'Compressive strength', below.
Figure 2. Visualising the relationship between 'Compressive Strength' and 'Age'.
In summary, if predictions can be made at all, they are more complex than pairwise (or linear) relations between the features. We need something more sophisticated.
How does a decision tree perform?
We begin by splitting the data into train and test sets. The test set is used to plot the graphs below, so that we are not showing the model examples that it has already seen.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=31,
)
Now we can use the scikit-learn
library to train a linear regressor and a decision tree regressor. We then make predictions on the test set and plot the results.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
# Linear model
linear = LinearRegression()
linear.fit(X_train, y_train)
# Decision tree
dtree = DecisionTreeRegressor()
dtree.fit(X_train, y_train)
# Predict with test data
y_dtpred = dtree.predict(X_test)
# Linear prediction
y_linpred = linear.predict(X_test)
Figure 3. Performance of a Linear Regressor vs. a Decision Tree.
These scatter plots display the predicted outputs associated to each recorded (ground truth) value for compressive strength. So a perfect predictor would have all of its points along the diagonal pink line.
We measure our success with two quantities:
-
-value: the proportion of the variation in the predicted value that is predictable from the ground truth. That is, if is close to 1 then the predictions are very close to the actual values. If close to or less than zero, then the residual in the prediction is equal to or greater than the variance in the data itself. (1, good, 0 bad.)
-
MSE: The 'mean-square error'. This term is the sum of squares error between the prediction and the ground truth. This is a relative term. However, if you take the square root, it gives a practical measure of error in the units of the observed quantity (in this case 'megapascals' for compressive strength). Useful for interpretation.
Looking at the graphs, the decision tree performs far better than a linear fit, as expected. However there are still some large miss-predictions (this is not captured by or MSE scores) in regions of little data.
Now let us try a random forest
Using the scikit-learn
model we import a random forest regressor
from sklearn.ensemble import RandomForestRegressor
rforest = RandomForestRegressor(
random_state=31,
n_estimators=100,
)
rforest.fit(X_train, y_train)
# Random forest prediction
y_rfpred = rforest.predict(X_test)
Let's compare the results with the decision tree.
Figure 4. Performance of a Decision Tree a Random Forest, an ensemble of decision trees.
Here we see a great improvement in MSE, and an incremental improvement in . The forest has reduced the large errors in the predictions (interpolating more successfully). Thus, ultimately, increasing the generalisation power of the model.
Selecting optimal hyper-parameters
In order to better understand the behaviour of the model, we can try to vary some of the hyper-parameters. Here, we will use the technique of Out-Of-Bag samples to validate our models during training and measure the effects of parameters. Some of the parameters of interest are:
- We can vary the
max_features
argument of theRandomForestRegressor
to restrict the size of the bootstrapped samples. - We can increase the number of trees,
n_estimators
, to identify when the algorithm stabilises.
Let's plot a few different choices of these hyper-parameters to see how they fair on this data set.
Figure 5. Random Forest performance: measuring the Out-Of-Bag error rate against the number of trees (weak learners) in the forest. We evaluate this for varying constraints on the `max_features` parameter.
This supports our rule of thumb that a larger number of features is preferable for regression tasks (Classification tasks should take the square-root). We also see that the number of trees stabilises at about 100, which is the default in scikit-learn
.