Dr Andy Corbett

by Dr Andy Corbett

Lesson

Gradient Boosting

14. Cross validate with the XGBoost API

📂 Resources

Download the resources for this lesson here.

Picking up where we left off in the last lesson, we return to our code example to delvee into XGBoost slightly further. By this we mean opening up some of the features of the API, including feature importance and crossvalidation. These way we can produce an explaind and tested algorithm ready for professional use cases.

📑 Learning Objectives
  • Return to the previous trained XGBoost model.
  • Asses the feature importance of the model and plot the results.
  • Perform cross-validation of the model ready for deployment.

Feature importance with XGBoost


XGBoost has an inbuilt explainability function via its feature importance module. We can assess the relevance of each feature to the overall model performance, indicating, from many, which features help best determine the dependent variable. There are three methods, which default to weight:

  1. weight: This is a measure of how many times each feature is used to split each given tree, on average over the ensemble.
  2. gain: This is a measure of accuracy gain on the prediction with the inclusion of a given feature split.
  3. cover: Considers the number of data points evenly split at each node split by a given feature.

Calling setting xgbr.importance_type to one of these before training and then calling xgbr.feature_importances_ we can plot the results in a bar chart.

sort_idx = xgbr.feature_importances_.argsort()
predictor_names = df.columns[:-1]

plt.barh(predictor_names[sort_idx], xgbr.feature_importances_[sort_idx])

plt.rcParams['figure.figsize'] = [6, 5]
plt.xlabel('Feature Importance')
plt.show()
XGBoost Feature Importance

Figure 1. XGBoost feature importance ranking.

Cross validation with the XGBoost API


When reporting your results to a commissioner, they will probably ask for two things:

  1. Reliability
  2. Explainability

For the latter (2), a bonus of the XGBoost is that it is built on a decision tree--a very intuitive model. You can point to Fig. 1 and show them how the model filters inputs by splittings in the graph. You can additional show them the rankings of how important each feature is to the overall model (Fig. 2).

To answer (1), we have already considered assessing the RMSE and the R2R^2 value of our model on the held-out test data. We did this by holding back 20% of the total data set; we then used this validate the model, a measure of reliability.

However, if we start changing hyperparameters based on this test set, it is no longer impartial (or unbiased) to the model. One fix for this would be to withhold another set on which to perform a final validation after hyperparameter tuning. But when data is scarce, this is simply not an option.

Answer: Cross Validation. This is a technique whereby a small amount of data is withheld from the set and the model is trained on the remainder, just like before. This repeated, at random, kk times, and the RMSE score is reported on average across the kk models. In this way the model is agnostic to the choice of test set. Based on this experiment we can select appropriate hyper parameters for the model.

Now let's use XGBoost in a different way, with the native python API.

Firstly, we store the parameters to our model as a dictionary.

params = {
    "objective": "reg:squarederror",
    'colsample_bytree': 0.3,
    'learning_rate': 0.1,
    'max_depth': 5,
    'alpha': 10,
}

This is the first time we are seeing the parameter alpha which sets L1L_1-regularisation for the weights; read: a higher value produces a more conservative model, with more sparse parameters. If the sum of the residuals are outside the interval (-alpha, alpha) then the contribution to the loss is adjusted by a proportional amount. Alternatively, one can set lambda for L2L^2-regularisation which likewise prefers outputs in (-lambda, lambda) but favours smooth solutions.

Next, we assemble our XGBoost data structures into a list on which a model will evaluate:

evals = [(dtest, "validation"), (dtrain, "train")]

To repeat the model fitting of the sklearn API we would then run.

# Train model
model = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=10000,
    evals=evals,
    verbose_eval=100,
    early_stopping_rounds=1000,
)

Since we want to operate in the wild, typically one will consider a great many number of boosting layers num_boost_round (additional stumps), but include an early stopping parameter early_stopping_rounds if the model fails to reduce the validation loss over this many extra boosts.

We use verbose_eval to only print every 100 boosts during training.

Similarly, to assess this selection of parameters we can perform kk-fold cross validation. Here we choose k=5k=5

results = xgb.cv(
    params,
    dtrain,
    num_boost_round=10000,
    nfold=5,
    early_stopping_rounds=100,
)

Note that the only additional parameter here is nfold. This specifies the number of cross-validation events we wish to conduct.

Conveniently, this exports the results to a pandas data frame, where results.head() returns.

train-rmse-meantrain-rmse-stdtest-rmse-meantest-rmse-std
035.84640.39932635.87021.75661
132.8080.41483732.90171.69339
230.06160.29422230.21731.75777
327.60390.30041927.791.69836
425.38720.3040325.62771.65701

The reported RMSE values are given as an average across all kk runs, so are agnostic of the kk-fold testing set. The standard deviation indicates the confidence in the prediction.

The results table has number of rows equal to the number of boosting rounds. From here we can request the minimum RMSE value from the test-rmse-mean, and similarly the predictions with lowest standard deviation.

min_rmse = results['test-rmse-mean'].min()
min_sd = results['test-rmse-std'].min()

We can then select our number of boosting rounds that achieves best performance in terms of RMSE, which in our case is 571.

Conclusion


As we said at the start, XGBoost is one of the most effective algorithms available and we hope this tutorial gives you a flavour of the model to help you apply to your own datasets in the wild.