by Prof Tim Dodwell
Machine Learning Workflow
12. Exploring If Models are Any Good - Training Curves
Exploring if Your Model Is Any Good - Training Curves
In this explainer we look at how we validate our machine learning models once they have trained. The is an essential part of any machine learning workflow. We will cover
- The basic approarch to estimating a training and testing loss.
- Extension of this to k-fold cross validation
- Understanding what are and how to look for good and bad model in training / learning curves.
- Understand what data leakage is, and how you might observe it at validation stage.
Validation is a key fundamental step in the machine learning workflow. The concepts of how you might do it are quite simple, however it requires some honest detective work to really be confident in you models ability to generalise. Remember this our ultimate goal in machine learning.
Let us first look at the most basic step, which is to estimate a training and testing loss.
Estimating Training & Testing Loss - Test / Train Split
There are a few exceptitions in machine learning (aren't there always), but here we a primarily talking about supervised learning algorithms. This is when you have a dataset of example pairs of inputs and outputs together, i.e. . As you will see in many of our supervised learning examples, prior to training a model we will perform a splitting of our data. A percentage of the total data is retained for testing (a typical default might be 20%).
This ring-fenced testing data will not be seen by the algorithm at training. How well the machine learning model does at on this testing set gives an indication of how well a model generalises.
To evaluate "how good?" we need to evaluate the average loss function over training and testing data, which we might call and
So how do we do this in scikit-learn
?
In most ML libraries you will find a test / train split functionality. In scikit-learn
, you can perform a test train split using the train_test_split function from the sklearn.model_selection
module. This function randomly splits a dataset into training and testing sets, which can be used for machine learning model training and evaluation, respectively.
Here is a bit of sample code of how to use train_test_split to split a dataset into 80% training and 20% testing data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Here, X
and y
are the input features and target variable, respectively. test_size=0.2
specifies that we want to allocate 20% of the data for testing, and random_state=42
ensures that the split is reproducible.
The resulting X_train
, X_test
, y_train
, and y_test arrays can then be used for training and evaluating machine learning models.
K-Fold Cross-Validation
K-fold cross-validation is a technique used in machine learning to evaluate the performance of a model on a limited amount of data. The main reason why you would use k-fold cross-validation is to get an estimate of the model's performance that is more reliable than simply splitting the data into a training and a test set.
This is particular the case when you have limited amount of data, you might not be able to afford to set aside a portion of it for a test set. In this case, k-fold cross-validation allows you to make the most of the data you have by using all of it to evaluate the model.
So what do you do? You split you data into folds. This means that of the data is partitioned for training. You will then train models, cycling though using partition as the testing data, the rest as the training data.
Best shown by a picture. Here we show a -fold cross valiation, a differen of the data is used in each fold. We can then calculate the loss (error) in each fold, avereage the results.
Importantly we get better estimate of performance of a model. By repeating the training and testing process multiple times, k-fold cross-validation provides a more stable estimate of the model's performance than a single split into training and test sets. This is particularly important when the performance metric you're interested in is sensitive to the specific samples in the test set, which is particularly the case when we have limited data.
So, again, how do we do this in scikit-learn
?
In scikit-learn
, you can perform k-fold cross-validation using the KFold
function from the sklearn.model_selection module. This function splits the dataset into k equal-sized folds, and trains and tests the model k times, using a different fold for testing each time, and the remaining folds for training.
Here's an example of how to use KFold to perform 5-fold cross-validation:
from sklearn.model_selection import KFold
kf = KFold(n_splits=4, shuffle=True, random_state=42)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# train + evaluate training + test loss
Here, X
and y
are the input features and target variable, respectively. n_splits=4
specifies that we want to perform 4-fold cross-validation, and shuffle=True
specifies that the data should be shuffled before splitting.
The KFold
object kf
returns an iterator that generates the indices of the training and testing data for each fold. The train_index
and test_index
arrays can be used to extract the training and testing data for each fold, and then train and evaluate the model using these data.
We will see more examples of this when we start demonstrating various supervised learning algorithms across the course.
Validation at Training Loss Curve
So with many machine learning models, models are trained iterative. We have looked at this when talking about optimisation methods. A step of training is called an epoch. So when we are training a model, we will monitor both the training and testing loss as a function of number of epochs.
This leads to a typical training curve, as below. Here we show, what a training curve for "good model" might look like.
As we train the model, the blue curve shows the training loss reducing. This is to be expected, since we are updating the parameters of a model to achieve this. With this, for a model which is also learning to generalise, we expected a reduction in the testing loss. This should naturally lag behind, the training loss. If both reduce, as shown above, the this indicates a training has been successful.
If training loss does not decrease with epochs the first step is to try reducing your learning rate in your optimisation step or try different start guesses.
When this doesn't happen then this gives us strong indications of what is happening to our model.
First let's look at the case where the both the training and test loss do no reduce much, leaving a large residual error which can be removed by additional training. Like the following training loss curve.
A training loss curve like this indicates that the model is "underfitting". This will eb covered in other areas of the course, yet underfitting is when a model doesn't have sufficient flexibility to be able to express the variations in the data. A simple example would be if you used a straight line to model the tidal heights in a day. The best it could model would be the average. Since we can't fit even the training data well, we see this large residual in the error, which doesn't improve as we increasing training. Solution is here is to go back to your model, and add more features / flexibility.
Secondly, is the opposite of this effect, overfitting. Here the model has too much flexibility. This is way it can reproduce the training data well, but is not constrained sufficiently that it generalises well. We show a typical overfitting model next to our training curve to give the idea.
Thirdly, is an odd one. Here the model looks like his has trained really well. Both the training and testing data have come right down, as we increase the number of training epochs. The problem is it is too good. Since the testing data is matching the performance of the training data, in some cases doing better.
This typical occurs when we have data leakage. This means the testing data and the training data aren't independent. Therefore there is information in the training data which explicitly helps with our testing set, rather beyond the ability of the algorithm to generalise.
-
Good examples of this is when we have time series data. If we do a random test train split of time series data, we have have data in both the testing and training set which is strongly correlated. To build a test train set for time series data we must leave sufficient space so that there is little or no correlation between the training and testing sets. As we show here.
-
Another exampe is called "group leakage" - so a classic example was where Andrew Ng's research group had 100k x-rays of 30k patients, meaning approximately 3 images per patient. The algorithm used random splitting instead of ensuring that all images of a patient was in the same split. Hence the model partially memorized the patients instead of learning to recognize pneumonia in chest x-rays
Concluding Remarks
Validation is a key fundamental step in the machine learning workflow. The concepts of how you might do it are quite simple, however it requires some honest detective work to really be confident that your model (1) is trained and (2) can generalise to unseen data.
We have presented a few ideas on how you can do this. Other techniques can be picked up in the worked examples through the digiLab courses. This is because evalating a good model, takes experience.
Remember, generalisation is our ultimate goal in when building a machine learning algorithm.