Dr Andy Corbett

by Dr Andy Corbett

Lesson

Gradient Boosting

13. XGBoost in the Wild

📂 Resources

Download the resources for this lesson here.

XGBoost: You may have heard of this popular algorithm. You may have not. But every professional implementing ML in industry will be very familiar with this package due to its speed, accuracy and sheer predictive power. This algorithm is pre-backed and tuned for deployment in industry ML stacks. So let's have a go ourselves.

📑 Learning Objectives
  • Unpack the XGBoost package in python. - Deploy on a real-world problem that we have seen before. - Visualise the learnt algorithm and present results.

A good place to start in our quest to uwrap XGBoost is with a simple import.

import xgboost as xgb

XGBoost is more advanced still than the process we describe above. It additionally uses second order gradients (gradients of gradients) in order to implement a 'Newton method' optimisation, rather than the first-order gradient descent approach. This means the residuals are tuned faster, and less weak learners--the stumps--are needed for the ensemble. We refer to the inclusion of successive weak learners as boosting rounds.

Step 1: Pre-process the problem


Let's consider our concrete compression prediction problem that we showed the forest. Here we were predicting the compressive strength as a function of 8 independent variables.

Input variables: Cement; Blast Furnace Slag; Fly Ash; Water; Superplasticizer; Coarse Aggregate; Fine Aggregate.

Output target: Compressive strength.

We begin by importing the dataset and dividing it into training and testing sets.

from sklearn.model_selection import train_test_split
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls'

# Read excel file (Requires install of `xlrd` package)
df = pd.read_excel(url)

target_name = df.columns[-1]
input_names = df.columns[:-1]

# Check the column names
print('Target: ', target_name)
print('Features: ', input_names)

# Split into independant and dependant variables
X = df.drop(target_name, axis=1).to_numpy()
y = df[target_name].to_numpy().reshape(-1, 1)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=31,
)

This is the format we use for sklearn models. XGBoost conveniently gives us an sklearn API for seamless integration. XGBoost also offers its own API with additional functionality. We shall explore both in this tutorial.

For the XGBoost API we use a bespoke data loader.

dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

This DMatrix class can take in various data formats, such as numpy arrays and pandas data frames. Saving the data into a binary buffer file will improve data loading speed.

dtrain.save_binary('train.buffer')

Which can then be reloaded again via

dtrain = xgb.DMatrix('train.buffer')

Ultimately, using the XGBoost API offers some speed up in training and a great deal of speed up in prediction, as well as some additional functionality. We shall give a flavour of both here.

Step 2: Run a basic model


Let us load our first XGBoost model using the sklearn API. As our problem is a regression problem we create the following object.

xgbr = xgb.XGBRegressor(
    objective ='reg:squarederror',
    n_estimators = 100,
    max_depth=3,  # Also available: `max_leaves`
    learning_rate = 0.1,
    colsample_bytree = 0.7,
)

We have included a few optional arguments here to discuss.

  • objective: This argument corresponds to the XGBoost API when selecting the model. The default for the XGBRegressor is 'reg:squarederror'; this tells the API that we have a regression problem, and our penalty is the mean-squared error function. Not needed to be specified in the sklearn API.
  • n_estimators: This is the (max) number of stumps to include in our model.
  • max_depth: This controls the largest depth each stump can grow to. Recall that we want weak learners--stumps!--so this parameter is usually very low.
  • learning_rate: This is a multiplier to reduce the amount each successive stump (estimating the gradient/residual) contributes to the overall prediction.
  • colsample_bytree: This argument tells the model to randomly choose a proportion of the features when training each stump. The right balance here helps to prevent from over fitting; however too low a value could lead to under fitting.

Just as in the sklearn interface, we can fit our numpy data and run analysis on the predictions from the test data.

from sklearn.metrics import mean_squared_error

xgbr.fit(X_train, y_train)
preds = xgbr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))

print(f"RMSE: {rmse}")

The RMSE value here is intuitive, as it describes our total prediction error away from the mean in the units of the target variable, compressive strength (MPa):

RMSE: 5.14

Alone, this value doesn't tell us a great deal. The reported RMSE is useful to compare different modeling paradigms. We shall use the XGBoost API to further inspect our model performance through cross validation.

Firstly, let's look through some other useful tools in our model. First of all we can save our trained model with xgbr.save_model('model.json'). Loading the corresponding file is achieved with xgbr.load_model('model.json').

We can determine the R2R^2-value of the prediction-vs.-ground truth curve using xgbr.score(X_test, y_test), which for this model returns

0.90

Note that, straight out of the box the XGBoost model equals the performance of our Random Forest predictor, also scoring 0.90. We hope to improve upon this with careful hyper-parameter selection.

Visualise the stumps


XGBoost has a utility for plotting the tree decisions. That's given by the following function.

xgb.plot_tree(xgbr, num_trees=0)
XGBoost decision graph

Figure 1. The decision graph for the first tree in the model.

This figure displays a graph of the first weak leaner (setting num_trees to zero). The first learner is the most meaningful to asses as it attempts to model the solution directly, whereas each subsequent tree is learning the error from the previous.