Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

Machine Learning Workflow

11. Training, Optimisation & Hyper-parameter Tuning

In this explainer we go through the key concept of training. We look at this in two stages, training of the parameters of a model and also in selecting / tuning of hyper-parameters.

You should leave with a clear idea of that happens under-the-hood when you call .fit(X,y) in sklearn, and understand some of the choices you can make to fit a better machine learning model, or make the training process easier through good selection of hyper parameters.

Training a Machine Learning model really centres around using methods in optimisation. Training ML models often involves minimisation of a loss or objective function, or maximising a quantity (could be a reward or a variance).

We define two sets of parameters:

  • Parameters are the values that a model learns during training in order to make predictions on new data. These values are determined by the algorithm and are specific to the model architecture and dataset being used. Examples of parameters include weights in a neural network or coefficients in a linear regression model. Parameters are updated during training to minimize the difference between the model's predictions and the actual output. A model with parameters is a "parameteric model"!

  • Hyperparameters, on the other hand, are settings that are not learnt by the model during training. Instead, they are set by the user prior to training and are used to control the learning process. Examples of hyperparameters include the learning rate, the number of hidden layers in a neural network, and the regularization strength (all of which you will come across at some point). Hyper-parameters can have a significant impact on the model's performance, and finding the optimal values often requires experimentation and tuning. Optimisation are used (often different ones) to perform tuning of hyperparameters.

Optimisation of Parameterised Models


So when we think of optimisation of parameter models we can think of the predicts being described by some parameterised function y~=fw(x),\tilde y = f_{{\bf w}}({\bf x}), here the dependence on parameters is noted by w{\bf w}. So for example, these could be the coefficients in a linear regression model, or they could be weights within a neural network.

Once we set these parameters the mapping from inputs to outputs is well defined.

Here let's think about training of a supervised learning algorithm. So we have a training set which are example pairs of inputs and outputs

X:={(x1,y1),(x2,y2),,(xN,yN)}X := \{ ({\bf x}_1, y_1), ({\bf x}_2, y_2), \ldots, ({\bf x}_N, y_N) \}

As discussed (explainer on loss functions) we can describe how well a model fits this training set by calculating a loss function, so for example a simple loss for a regression problem might be

L(w)=1Ni=1N(yifw(xi))2\mathcal L({\bf w}) = \frac{1}{N}\sum_{i=1}^N \left(y_i - f_{{\bf w}}({\bf x}_i)\right)^2

the average mean square error over the training set. Training the model here, means finding the weights w{\bf w} which minimise the loss function, i.e. minimising the error.

Let's draw a picture

academy.digilab.co.uk

This sketch shows a plot of a loss function with respect to weights. Here we have a simple single parameter problem. The idea is we want to find the value of ww which gives the local minma. We use optimisation strategies to do this. We can see for complex loss functions (i.e. complex models and data) we can easily get "trapped" in local minima, so this problem isn't easy.

For this task we can use a whole suite of optimisation techniques. But the most commonly used are gradient based methods, which we will look at.

Gradient Descent

So I like to think of minimisation problems as an landscape, where I am trying to walk (or run) to the lowest point in a mountain range. Gradient descent is trying to do this too. It basically iteratively updates the weights w{\bf w} to move down the hill (the loss function) by a little bit, and then repeats

So to be a bit more mathematical, down hill (or gradient descent) means against the slope. The slope or gradient of a function is denoted by L\nabla \mathcal L (aka grad L\mathcal L), and is calculated by differentiating the loss function with respect to the weights w{\bf w}. So we see that gradient descent is written like

w(k+1)=w(k)ηwL{\bf w}^{(k+1)} = {\bf w}^{(k)} - \eta \nabla_{{\bf w}}\mathcal L

Here the parameter η\eta is called the step size or learning rate, and therefore determines how big a step I take at each iterations. Here is a sketch of what is going on!

academy.digilab.co.uk

In this example, at w(i)w^(i) the gradient of the loss is negative, therefore the ηL-\eta\nabla\mathcal L is positive, stepping towards the minima.

The learning rate is hyper-parameter of my method, and needs to chosen so the optimisation strategy converges.

Too small a learning rate, and the solution converges very slowly towards a local minima. A too large learning rate and gradient descent can become unstable, diverging away from a solution.

We can illustrate this with a simple little sketch

academy.digilab.co.uk

Finding a good learning rate can be challenging. There are more sophisiticated methods. We can talk more about this in hyper-parameter tuning.

Stochastic Gradient

We see that the central cost in gradient descent is calculating the derivative of the loss

wL=w[1Ni=1N(yif(xi))2]=2Ni=1N(yif(xi))wf(xi)\nabla_{\bf w} \mathcal L = \nabla_{\bf w}\left[ \frac{1}{N}\sum_{i=1}^N \left(y_i - f({\bf x}_i)\right)^2\right] = \frac{2}{N}\sum_{i=1}^N \left(y_i - f({\bf x}_i)\right)\nabla_{{\bf w}} f({\bf x}_i)

We can reduce the cost of this by only calculating an approximation of the gradient, by randomly sampling a subset of the training data. Reducing the computation to a loop over MNM \ll N training points

We then call this approximate Gradient ~wL\tilde{\nabla}_{{\bf w}}\mathcal L, but the update to the weights is the same as before

w(k+1)=w(k)+η~wL{\bf w}^{(k+1)} = {\bf w}^{(k)} + \eta \tilde{\nabla}_{{\bf w}}\mathcal L

This process recalled Stochastic Gradient, there are various versions of this, but this is the core approach, and used widely in machine learning. Particularly when datasets are very large. This is very typical in Neural Network applications, since due to the huge number of parameters there is also huge data requirements.

Optimisation of Hyper-Parameters or should we saying "Tuning".


Hyper-parameters are parameters which control the model defition or the training process. They are not learnt directly from data, and are set before the "training" process and greatly effect the training process and overall model performance.

Examples of this will include

  • Learning Rate
  • Dropout Rate (particularly to deep learning)
  • Regularisation Parameters
  • Parameters defining model structure, e.g. number of hidden layers, clusters, trees and layers.
  • High level model parameters, for example kernel parameters for a Gaussian Process.

Let's look at took three different options

Grid Search - very simple way of finding "optimal" hyperparameters. Train models for each of the parameters over a grid of values, down select the best. The challenge he is it requires the building of many machine learning models, if expensive to train each then this can be really challenging computational. For simple models it is a really nice simple approach which can be implemented very quickly.

academy.digilab.co.uk

Hyperband - Hyperband is a very simple but effective idea. It initial start training many models with different hyper-parameters. After one a period, the models are rank in terms of performance, the worse are 50% and killed, and computational resource is doubled down on the best hyper-parameter values. This is also known as successive halfing.

academy.digilab.co.uk

Bayesian Optimisation - Is a more complex approach by which a surrogate model for the model performance over the space of hyper-parameters is built. This allows for the minimum can be estimated, and additional hyper-parameter can be sampled. The approach is much more targeted than Grid Search, and providing the number of hyperparameters is not too large then can be extremely effective. Implementation of these approach, which using Gaussian Processes as a surrogate, are more sophisticated to implement and require use of packages like twinLab for example.

academy.digilab.co.uk