Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

4. Underfitting and Overfitting

In this explainer we will talk about the key machine learning principles of under-fitting and over-fitting in the context of linear models. However, they apply generally to many types of machine learning models, the best known examples are Neural Networks.

Summary of the key learning outcomes are:

  • Understand the importance and connection between over and underfitting and generalisation.
  • We will demo what over and underfitting is with a simple polynomial model.
  • We will described the general strategies we have for overcoming over-fitting.

For me Machine Learning is all about "Generalisation". To recap what we mean by generalisation. It is the ability of a model to extend to giving good predictions (or answers) on cases outside it's general training set. In general for me it is what we recognise as "intelligence" in an algorithm.

From this viewpoint, generalisation should be a (or some cases the) primary metric we would look at when evaluating a model. So when do models not generalise well?

So Let's think of a really simple example. Suppose we have an underlying process (a damped oscillator for example) which generates data as follows

yj=exp(18xj2)sin(4πxj)+εy_j = \exp\left(-\frac{1}{8}x_j^2\right)\sin \left(4\pi x_j\right) + \varepsilon

where εN(0,0.1)\varepsilon \sim \mathcal N(0, 0.1) to simulate noisy data. Generating data N+MN + M evenly spaced on x[0,1]x \in [0,1]. We are going to do a "test/train split", so use NN for training (green dots), whilst keeping MM back for testing (red dots).

Ok, so now let us try and fit a polynomial model to this of the form

fK(x)=w0+w1x+w2x2++wKwK=k=0Kw0xk.f_K(x) = w_0 + w_1x + w_2x^2 +\ldots + w_Kw^K = \sum_{k=0}^Kw_0x^k.

Let us make KK (the order of the polynomial) large, so K=25K=25.

digiLab Academy - General Linear Models - basis functions | academy.digilab.co.uk

So what are we looking at here. The true funciton is shown as a solid black line, whilst the polynomial fit is shown in blue. We see that the fit to training points (shown in green) is very good, however the accuracy at unseen points (red points) is poor.

In this case we see that the model does not generalise well, to unseen data. It isn't a good model. This is a simple example of overfitting. So overfitting is characterised by a low training error, and a high validation error. Training curves with this sort of shape

digiLab Academy - General Linear Models - basis functions | academy.digilab.co.uk

The opposite to overfitting is underfitting. This is the case where the representative power of the model is not sufficient to describe the features in the data. So a much lower order polynomial will demonstrate this. So for example lets look at a third order polynomial

digiLab Academy - General Linear Models - basis functions | academy.digilab.co.uk

Here we see that a cubic (third order polynomial) is insufficiently expressive to describe the oscillations in the data. Here we see that the polynomial captures the general decaying trend of the data, but not the refined oscillations. This model is described as underfitting the data. Underfitting is characterised by both a high residual training error (i.e. it can be reduced beyond some point) and a high validation error.

Ok so how to build model which do not overfit or underfit, i.e. how do we build models which are "just right"!

In a polynomial model, there is a nice way to control this over and underfitting. We optimise over the order of the polynomial. So let's look at the a plot of training and validation error against order of the polynomial.

So let us build models from order k=5k=5 all the way up to 4040, and plot the validation error against the order polynomial.

digiLab Academy - General Linear Models - basis functions | academy.digilab.co.uk

We see that models of order K=10K=10 to K=15K=15 minimises the validation error, anything in this range gives a "good" model! Here we do not look at the training error, after the model order exceeds the amount of data the model fits the training data perfectly.

Ok so we can play with the order of a polynomial, but in general how are the ways in which we can control overfitting:

  1. Control the number of parameters in the model. So this is the example that we have just looked. We can control the order of the polynomial for a simple polynomial model, but in general for machine learning we can control the number of parameters we have. So for example in a neural network this would be the number of nodes or layers we have, in a random forest the number of leaves we have. The more parameters we have the more likely we are going to overfit. In the next section will see general strategies for constraining the number of parameters in a model, rather than simple playing with the maximum order. There is a more sophisticated approach for control the number of parameters, which strategically drops parameters which offer little benefit to fit, this process is called regularisation which we talk about in a following explainer.

  2. Increase the amount of data. Overfitting is a function of the amount of data. which is dependent on the number of parameters we have to play with, but also on the amount of data we have in relation to the number of parameters. Therefore it is more likely in cases where we have only limited data, and highly parameterised models. Limited data is a function of a large number of inputs, so what is enough data is often a complex questions. But in general collecting or using more data will help with overfitting.

  3. Dropout. So we aren't going to cover this alot here, but a technique widely used in training of larger machine learning models like neural networks, which always have the tendency to be overfit to some degree, is called dropout. Dropout is a strategy used at training time, where a percentage of parameters are randomly set to zero at each training set. What does this do? It removes the dependence on a particular or all terms, and the general approach is one that is both expressive but robust in it's representation. This is often done alongside batch training or optimisation, where the loss or fit is only access over a subset of the data. More on this in our deep learning courses.

Conclusion


Overfitting and Underfitting should be a central consideration when building any machine learning model. It is particularly important in cases where you have high dimensional (many) inputs and / or smaller datasets. A good model balances fitting and complexity, with the ultimate aim of ensuring good generalisation.

Whilst there are various strategies to control overfitting, the widely used approaches are more data collection, regularisation and dropout. We will look at regularisation in the next walk through.