by Dr Andy Corbett
Gradient Boosting
14. Cross validate with the XGBoost API
Download the resources for this lesson here.
Picking up where we left off in the last lesson, we return to our code example to delvee into XGBoost slightly further. By this we mean opening up some of the features of the API, including feature importance and crossvalidation. These way we can produce an explaind and tested algorithm ready for professional use cases.
- Return to the previous trained XGBoost model.
- Asses the feature importance of the model and plot the results.
- Perform cross-validation of the model ready for deployment.
Feature importance with XGBoost
XGBoost has an inbuilt explainability function via its feature importance module. We can assess the relevance of each feature to the overall model performance, indicating, from many, which features help best determine the dependent variable. There are three methods, which default to weight
:
weight
: This is a measure of how many times each feature is used to split each given tree, on average over the ensemble.gain
: This is a measure of accuracy gain on the prediction with the inclusion of a given feature split.cover
: Considers the number of data points evenly split at each node split by a given feature.
Calling setting xgbr.importance_type
to one of these before training and then calling xgbr.feature_importances_
we can plot the results in a bar chart.
sort_idx = xgbr.feature_importances_.argsort()
predictor_names = df.columns[:-1]
plt.barh(predictor_names[sort_idx], xgbr.feature_importances_[sort_idx])
plt.rcParams['figure.figsize'] = [6, 5]
plt.xlabel('Feature Importance')
plt.show()
Figure 1. XGBoost feature importance ranking.
Cross validation with the XGBoost API
When reporting your results to a commissioner, they will probably ask for two things:
- Reliability
- Explainability
For the latter (2), a bonus of the XGBoost is that it is built on a decision tree--a very intuitive model. You can point to Fig. 1 and show them how the model filters inputs by splittings in the graph. You can additional show them the rankings of how important each feature is to the overall model (Fig. 2).
To answer (1), we have already considered assessing the RMSE and the value of our model on the held-out test data. We did this by holding back 20% of the total data set; we then used this validate the model, a measure of reliability.
However, if we start changing hyperparameters based on this test set, it is no longer impartial (or unbiased) to the model. One fix for this would be to withhold another set on which to perform a final validation after hyperparameter tuning. But when data is scarce, this is simply not an option.
Answer: Cross Validation. This is a technique whereby a small amount of data is withheld from the set and the model is trained on the remainder, just like before. This repeated, at random, times, and the RMSE score is reported on average across the models. In this way the model is agnostic to the choice of test set. Based on this experiment we can select appropriate hyper parameters for the model.
Now let's use XGBoost in a different way, with the native python API.
Firstly, we store the parameters to our model as a dictionary.
params = {
"objective": "reg:squarederror",
'colsample_bytree': 0.3,
'learning_rate': 0.1,
'max_depth': 5,
'alpha': 10,
}
This is the first time we are seeing the parameter alpha
which sets
-regularisation for the weights; read: a higher value produces a more
conservative model, with more sparse parameters. If the sum of the residuals
are outside the interval (-alpha
, alpha
) then the contribution to the loss
is adjusted by a proportional amount. Alternatively, one can set lambda
for
-regularisation which likewise prefers outputs in (-lambda
, lambda
)
but favours smooth solutions.
Next, we assemble our XGBoost data structures into a list on which a model will evaluate:
evals = [(dtest, "validation"), (dtrain, "train")]
To repeat the model fitting of the sklearn
API we would then run.
# Train model
model = xgb.train(
params=params,
dtrain=dtrain,
num_boost_round=10000,
evals=evals,
verbose_eval=100,
early_stopping_rounds=1000,
)
Since we want to operate in the wild, typically one will consider a great many
number of boosting layers num_boost_round
(additional stumps), but include an
early stopping parameter early_stopping_rounds
if the model fails to reduce
the validation loss over this many extra boosts.
We use
verbose_eval
to only print every100
boosts during training.
Similarly, to assess this selection of parameters we can perform -fold cross validation. Here we choose
results = xgb.cv(
params,
dtrain,
num_boost_round=10000,
nfold=5,
early_stopping_rounds=100,
)
Note that the only additional parameter here is
nfold
. This specifies the number of cross-validation events we wish to conduct.
Conveniently, this exports the results to a pandas
data frame, where results.head()
returns.
train-rmse-mean | train-rmse-std | test-rmse-mean | test-rmse-std | |
---|---|---|---|---|
0 | 35.8464 | 0.399326 | 35.8702 | 1.75661 |
1 | 32.808 | 0.414837 | 32.9017 | 1.69339 |
2 | 30.0616 | 0.294222 | 30.2173 | 1.75777 |
3 | 27.6039 | 0.300419 | 27.79 | 1.69836 |
4 | 25.3872 | 0.30403 | 25.6277 | 1.65701 |
The reported RMSE values are given as an average across all runs, so are agnostic of the -fold testing set. The standard deviation indicates the confidence in the prediction.
The results
table has number of rows equal to the number of boosting rounds. From here we can request the minimum RMSE value from the test-rmse-mean
, and similarly the predictions with lowest standard deviation.
min_rmse = results['test-rmse-mean'].min()
min_sd = results['test-rmse-std'].min()
We can then select our number of boosting rounds that achieves best performance in terms of RMSE, which in our case is 571
.
Conclusion
As we said at the start, XGBoost is one of the most effective algorithms available and we hope this tutorial gives you a flavour of the model to help you apply to your own datasets in the wild.