Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

2. The Foundations of ML - Curve Fitting

In this explainer we will give a general introduction to linear models.

  • We will talk through what a linear model is?
  • How it works for simple linear models, like fitting polynomials which we probably know about before.
  • Then we look how we can build linear model, which are more expressive.
  • We will show how linear models are trained, both for example feature problems and data sets, and for much larger problems.
  • Finally we will touch on the interpretation of bias, which comes naturally out of linear models.

Starter Question. Is the following a linear model?

f(x)=x+2sin(2x)+12exp(x2λ)f(x) = x + 2\sin(2x) + \frac{1}{2}\exp\left(-\frac{x^2}{\lambda}\right)

You should be able to confidently answer this question, and give a reason for your answer by the end of this explainer!

So in this explainer, we are very much looking at supervised learning methods. For this we will start with introducing Generalised Linear Models (or GLMs) in the context of regression.

A regression task is to predict one or more continous target variables yy, given a DD-dimesional vector xx of input variables.

So why are GLMs super important?

  1. The foundations for introducing the core concepts in machine learning.
  2. They are widely used and (can be) highly effective.
  3. If possible to build, they are often more explainable that other models, in because with our understand of "correlation" as a central concept.
  4. They provide a stepping stone to understand more complex machine learning models, in particular Neural Networks and Gaussian Processes.

Fitting Polynomials


So we are probably used to fitting straight lines through data. I remember doing this at school

f(x)=w0+w1xf(x) = w_0 + w_1x

where w0w_0 has the interpretation of the intercept and w1w_1 is the gradient (or slope) of the curve.

We can extend this to high-dimensional inputs, so what we mean by this is that there are two input variables x=[x1,x2]T{\bf x} = [x_1, x_2]^T, and so

f(x)=w0+w1x1+w2x2,f({\bf x}) = w_0 + w_1x_1 + w_2 x_2,

is a linear model (of linear functions) for a two dimensional input. Equally we can have models which are higher order polynomials, where we fit quadratic, cubic or high order curves through the data, so in general we have

f(x)=w0+w1x+w2x2++wKxK=w0+k=1Kwkxk.f(x) = w_0 + w_1x + w_2x^2 + \ldots + w_Kx^K = w_0 + \sum_{k=1}^Kw_k x^k.

So the key point here is that when we talk about linear models. It doesn't mean the functions themselves are linear (although they can be). So we can also go higher dimensions (more than one input) and higher order

f(x)=w0+w1x1+w2x2+w3x12+w4x1x2+w5x22f({\bf x}) = w_0 + w_1x_1 + w_{2}x_2 + w_3x_1^2 + w_4x_1x_2 + w_5x_2^2

for example, but can be arbitary complex

A More "General" Linear Model


A linear model is defined by the linear combination of functions. These functions which are known as basis functions or in Machine Learning features, and can be any general nonlinear function.

In general we can write a linear model as follows

f(x)=w0ϕ0(x)+w1ϕ1(x)++wNϕK(x)=i=0Kwiϕi(x),f({\bf x}) = w_0\phi_0({\bf x}) + w_1\phi_1({\bf x}) + \ldots + w_N\phi_K({\bf x}) = \sum_{i=0}^K w_i\phi_i({\bf x}),

where each of the wiw_i's (or weights) are scalar numbers or to be fancy we write wiRw_i \in \mathbb R. That is wiw_i belongs to the set of real numbers. In machine learning you might also see (and will see me write most the time) such a linear model written in vector form as follows

f(x)=wTϕwhereϕ=[ϕ0(x),ϕ0(x),,ϕK(x)]T.f({\bf x}) = {\bf w}^T{\boldsymbol \phi} \quad \text{where} \quad {\boldsymbol \phi} = [\phi_0({\bf x}), \phi_0({\bf x}), \ldots, \phi_K({\bf x})]^T.

Maybe to labour the point a little, this is just a more general case, so if we think of our original model we have f(x)=w0+w1xf(x) = w_0 + w_1x, then this is simple

ϕ0(x)1andϕ1(x)=x,\phi_0(x) \equiv 1 \quad \text{and} \quad \phi_1(x) = x,

and hence

f(x)=wTϕ=[w0w1]T[1x].f(x) = {\bf w}^T{\boldsymbol \phi} = [w_0 \quad w_1]^T \begin{bmatrix} 1 \\ x\end{bmatrix}.

It is often convention to reserve the first feature function, i.e. ϕ0(x)\phi_0({\bf x}) to be the constant function. This means that ϕ0(x)1\phi_0({\bf x})\equiv 1, and therefore

f(x)=w0+i=1Kwiϕi(xi)f({\bf x}) = w_0 + \sum_{i=1}^K w_i\phi_i({\bf x}_i)

The parameter w0w_0 therefore allows us to model any fixed offset in the data, and with that is often called the bias parameter (different from bias in a statistical sense).

The basis or feature functions themselves can be pre-defined. So for example for a single input parameter example the basis could be powers of xx, i.e. ϕk(x)=xk\phi_k(x) = x^k. Other options include Mexican Hat Function, where basis functions are defined on overlapping regions (think finite element shape functions) such that

Other basis functions include

ϕj(x)=exp((xμk)22λ2)\phi_j(x) = \exp\left(-\frac{(x - \mu_k)^2}{2\lambda^2}\right)

where are determined by a general point μk\mu_k and a length scale λ\lambda. These are often called Gaussian basis functions. Or another common option is the sigmoidal basis functions which are defined by

ϕj(x)=σ(xμjs)whereσ(x)=11+exp(x).\phi_j(x) = \sigma\left(\frac{x - \mu_j}{s}\right) \quad \text{where} \quad \sigma(x) = \frac{1}{1 + \exp(x)}.

To see what these look like on a plot, let's have a look - polynomials on the left, Gaussians in the middle and sigmoid basis functions on the right.

digiLab Academy - General Linear Models - basis functions | academy.digilab.co.uk

We will also see later in the course that the basis functions don't have to be well defined formulae as we have above. They can actually be discovered from the data. Whilst outside the scope of this explainer, but will be covered else where, unsupervised machine learning models can be used. For those interested in looking ahead consider reading about Principle Component Analysis (aka PCA) and Autoencoders. More on that later.

Training / Fitting a Linear Model


Important concept now is that given a set of basis functions ϕi(x)\phi_i({\bf x}), we now wish to find the weights w{\bf w} which minimise the error on the training data - so we need to introduce the idea of fit!

So a very natural way to define error is for NN pairs of training data (xj,yj)({\bf x}_j, y_j), we can define the Mean-square loss function for a given set of weights

L=12Nj=1N(fw(xj)yj)2=12Nj=1N(wTϕ(xj)yj)2\mathcal L = \frac{1}{2N}\sum_{j=1}^N \left( f_{\bf w}({\bf x}_j) - y_j \right)^2 = \frac{1}{2N}\sum_{j=1}^N \left( {\bf w}^T\boldsymbol \phi({\bf x}_j) - y_j \right)^2

Our objective is therefore to minimise this loss function.

A natural way to do this is to find the gradient of error with respect to the weight's wiw_i, and use this information to learn how to adjust (for nonlinear case) or pick the weights to minimise the error. The gradient basically tells us how much the error will increase/decrease if we increase/decrease the value of wiw_i. Optimisation strategies then seek to change the value of the weights to reduce the error.

Lwi=1Nj=1N(wTϕ(xj)yj)ϕi(xj)\frac{\partial \mathcal L}{\partial w_i} = \frac{1}{N}\sum_{j=1}^N\left({\bf w}^T{\boldsymbol \phi}({\bf x}_j) - y_j\right)\phi_i({\bf x}_j)

When we have reached a minimum error, then the gradient with respect to each of the weights will be zero.

In general, for nonlinear model, we would use some gradient based optimisation strategy for finding this optimal set of weights which minimises the error. Here, due to the linearity of the model, the solution can be calculated directly in closed-form. We won't go into the detail here, but give the solution. Let us first define the matrix, which is a rectangulat matrix often refered to as the design matrix

Φ=[ϕ0(x1)ϕ1(x1)ϕK(x1)ϕ0(x2)ϕ1(x2)ϕK(x2)ϕ0(xN)ϕ1(xN)ϕK(xN)]\Phi = \begin{bmatrix} \phi_0({\bf x}_1) & \phi_1({\bf x}_1) & \ldots & \phi_K({\bf x}_1)\\ \phi_0({\bf x}_2) & \phi_1({\bf x}_2) & \ldots & \phi_K({\bf x}_2)\\ \vdots & \vdots & \ddots & \vdots \\ \phi_0({\bf x}_N) & \phi_1({\bf x}_N) & \ldots & \phi_K({\bf x}_N) \end{bmatrix}

Then the "optimal" weights for the model are given by

w=(ΦTΦ)1ΦTy{\bf w} = \left( \Phi^T \Phi\right)^{-1} \Phi^T{\bf y}

We see that the calculation involves taking the inverse of a dense matrix i.e. (ΦTΦ)1\left( \Phi^T \Phi\right)^{-1}. This will be computationally expensive if the number of features KK and the number of data points NN is large. The time for calculation grows quadratically with KK and linearly in NN. For large numbers of features (KK) and therefore weights in such case we seek to use optimisation schemes like gradient descent.

Role of the bias parameter w0w_0


We will see in a work sample, how we can implement this and fit a linear model below.

Before we do this. It is a good opportunity to look at the role of w0w_0 the bias parameter, remembering that we conventionally choose ϕ0(x)1\phi_0({\bf x}) \equiv 1. So if we look at what happens by setting the gradient of the error with respect to w0w_0 to zero

ew0=1Nj=1N[wTϕ(xj)yj]ϕ0(xj)=0\frac{\partial e}{\partial w_0} = \frac{1}{N}\sum_{j=1}^N\Big[{\bf w}^T{\boldsymbol \phi}({\bf x}_j) - y_j\Big]\phi_0({\bf x}_j) = 0

So let us rewrite this a little noting that ϕ0(x)1\phi_0({\bf x}) \equiv 1, then

1Nj=1N(w0+k=1Kwkϕk(xj)yj)=w01Nj=1Nyjyˉ+k=1Kwk1Nj=1Nϕk(xj)ϕˉk\frac{1}{N}\sum_{j=1}^N\left(w_0 + \sum_{k=1}^Kw_k\phi_k({\bf x}_j)- y_j\right) = w_0 - \underbrace{\frac{1}{N}\sum_{j=1}^Ny_j}_{\bar y} + \sum_{k=1}^Kw_k\underbrace{\frac{1}{N}\sum_{j=1}^N\phi_k({\bf x}_j)}_{\bar{\phi}_k}

Ok, so all this equal zero, which means

w0=yˉk=1Kwkϕˉkw_0 = \bar{y} - \sum_{k=1}^Kw_k\bar{\phi}_k

What we notice is that the parameter w0w_0 makes a constant shift which corrects between the averages over the training set of the target variables and the weighted sum of averages of the basis functions evaluated at input points.