by Dr Andy Corbett
Gradient Boosting
13. XGBoost in the Wild
Download the resources for this lesson here.
XGBoost: You may have heard of this popular algorithm. You may have not. But every professional implementing ML in industry will be very familiar with this package due to its speed, accuracy and sheer predictive power. This algorithm is pre-backed and tuned for deployment in industry ML stacks. So let's have a go ourselves.
- Unpack the XGBoost package in
python
. - Deploy on a real-world problem that we have seen before. - Visualise the learnt algorithm and present results.
A good place to start in our quest to uwrap XGBoost is with a simple import.
import xgboost as xgb
XGBoost is more advanced still than the process we describe above. It additionally uses second order gradients (gradients of gradients) in order to implement a 'Newton method' optimisation, rather than the first-order gradient descent approach. This means the residuals are tuned faster, and less weak learners--the stumps--are needed for the ensemble. We refer to the inclusion of successive weak learners as boosting rounds.
Step 1: Pre-process the problem
Let's consider our concrete compression prediction problem that we showed the forest. Here we were predicting the compressive strength as a function of 8 independent variables.
Input variables: Cement; Blast Furnace Slag; Fly Ash; Water; Superplasticizer; Coarse Aggregate; Fine Aggregate.
Output target: Compressive strength.
We begin by importing the dataset and dividing it into training and testing sets.
from sklearn.model_selection import train_test_split
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls'
# Read excel file (Requires install of `xlrd` package)
df = pd.read_excel(url)
target_name = df.columns[-1]
input_names = df.columns[:-1]
# Check the column names
print('Target: ', target_name)
print('Features: ', input_names)
# Split into independant and dependant variables
X = df.drop(target_name, axis=1).to_numpy()
y = df[target_name].to_numpy().reshape(-1, 1)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=31,
)
This is the format we use for sklearn
models. XGBoost conveniently gives us an sklearn
API for seamless integration. XGBoost also offers its own API with additional functionality. We shall explore both in this tutorial.
For the XGBoost API we use a bespoke data loader.
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)
This DMatrix
class can take in various data formats, such as numpy
arrays and pandas
data frames. Saving the data into a binary buffer file will improve data loading speed.
dtrain.save_binary('train.buffer')
Which can then be reloaded again via
dtrain = xgb.DMatrix('train.buffer')
Ultimately, using the XGBoost API offers some speed up in training and a great deal of speed up in prediction, as well as some additional functionality. We shall give a flavour of both here.
Step 2: Run a basic model
Let us load our first XGBoost model using the sklearn
API. As our problem is a regression problem we create the following object.
xgbr = xgb.XGBRegressor(
objective ='reg:squarederror',
n_estimators = 100,
max_depth=3, # Also available: `max_leaves`
learning_rate = 0.1,
colsample_bytree = 0.7,
)
We have included a few optional arguments here to discuss.
objective
: This argument corresponds to the XGBoost API when selecting the model. The default for theXGBRegressor
is'reg:squarederror'
; this tells the API that we have a regression problem, and our penalty is the mean-squared error function. Not needed to be specified in thesklearn
API.n_estimators
: This is the (max) number of stumps to include in our model.max_depth
: This controls the largest depth each stump can grow to. Recall that we want weak learners--stumps!--so this parameter is usually very low.learning_rate
: This is a multiplier to reduce the amount each successive stump (estimating the gradient/residual) contributes to the overall prediction.colsample_bytree
: This argument tells the model to randomly choose a proportion of the features when training each stump. The right balance here helps to prevent from over fitting; however too low a value could lead to under fitting.
Just as in the sklearn
interface, we can fit our numpy
data and run analysis on the predictions from the test data.
from sklearn.metrics import mean_squared_error
xgbr.fit(X_train, y_train)
preds = xgbr.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print(f"RMSE: {rmse}")
The RMSE value here is intuitive, as it describes our total prediction error away from the mean in the units of the target variable, compressive strength (MPa):
RMSE: 5.14
Alone, this value doesn't tell us a great deal. The reported RMSE is useful to compare different modeling paradigms. We shall use the XGBoost API to further inspect our model performance through cross validation.
Firstly, let's look through some other useful tools in our model. First of all we can save our trained model with xgbr.save_model('model.json')
. Loading the corresponding file is achieved with xgbr.load_model('model.json')
.
We can determine the -value of the prediction-vs.-ground truth curve using xgbr.score(X_test, y_test)
, which for this model returns
0.90
Note that, straight out of the box the XGBoost model equals the performance of our Random Forest predictor, also scoring 0.90
. We hope to improve upon this with careful hyper-parameter selection.
Visualise the stumps
XGBoost has a utility for plotting the tree decisions. That's given by the following function.
xgb.plot_tree(xgbr, num_trees=0)
Figure 1. The decision graph for the first tree in the model.
This figure displays a graph of the first weak leaner (setting num_trees
to zero). The first learner is the most meaningful to asses as it attempts to model the solution directly, whereas each subsequent tree is learning the error from the previous.