Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

9. From Regression to Classification - Logistic Regression

In this explainer we look at how we can "generalise" linear models to work for classification problems. This is called logistic regression. At the end of this explainer your should:

  • Understand the key idea of logistic regression.
  • Understandit's make up (e.g. linear predictor + link function).
  • Understand how logisitic models are trained (e.g. loss function + gradient descent).
  • Know how logistic regression can be extended to multiple classes.
  • Know the issues of overfitting, and the solutions to this (regularisation).

Logistic regression is a statistical model which is used for classification problems. Logisitic regression estimates the probabilility pp that an event will occur. Hence the output of the model is between 0 and 1.

So we have a supervised learning problem, with our normal data set

{(x0,y0),(x1,y1),,(xN,yN)}.\{({\bf x}_0, y_0), ({\bf x}_1, y_1), \ldots, ({\bf x}_N, y_N) \}.

Where xi{\bf x}_i are our inputs and yiy_i are our target variables, in this case a binary label of either AA or BB.

Instead of fitting a straight line or hyper-plane (or any linear model), the logistic regression model uses the logistic function to squeeze the output of a linear equation between 00 and 11 so it represents a probability of giving label AA.

The logistic function is defined as:

pA(x)=11+exp(f(x))wheref(x)=wTϕ.p_A({\bf x}) = \frac{1}{1 + \exp(-f({\bf x}))} \quad \text{where} \quad f({\bf x}) = {\bf w}^T{\boldsymbol \phi}.

Probably easier if we sketch this function out, so let's do that

academy.digilab.co.uk

The pointe here is that the input to the sigmoid or logit function is the output of a linear model itself (fx)=wTϕ(f{\bf x}) = {\bf w}^T{\boldsymbol \phi}. As we will see this gives great flexibility.

To give an interpretation of this, we can rearrange the equations in terms of f(x)f({\bf x}), the linear model, so we get

f(x)=ln(pA1pA)=ln(pApB)=w0+w1ϕ1(x)++wMϕM(x)=wTϕ.f({\bf x}) = \ln\left( \frac{p_A}{1-p_A}\right) = \ln\left( \frac{p_A}{p_B}\right)= w_0 + w_1\phi_1({\bf x}) + \ldots + w_M\phi_M({\bf x}) = {\bf w}^T{\boldsymbol \phi}.

So pA/(1pA)=pA/pBp_A/(1-p_A) = p_A/p_B is the 'odds', the ratio of label A and not A (i.e. B). So the interpretation of a logistic model is building a linear model for the 'log odds'.

Before moving on how we train logistic regression model. Let us sketch some toy examples.

So here we look a simple linear model f(x)=w0+w1xf({\bf x}) = w_0 + w_1x, this function is then push through the sigmoid function, generating the probability of pAp_A against input values xx. In this example we decision boundary is at x=1/2x = 1/2.

academy.digilab.co.uk

As discussed, a linear model doesn't mean a linear function. So with logistic regression models there is the possibility of great flexibility. Here is an example where ff is a quadratic function, resulting in a more complex decision boundary

academy.digilab.co.uk

Before looking at training a logistic model, we note that logistic regression is an general extension of ordinary linear models. There are two ingredients

  • Linear Predictor just as in a standard linear model f=wTϕf = {\bf w}^T{\boldsymbol \phi}.
  • A link function, in our case the sigmoid function, but could be more general. The link function provides the relationship between the linear prediction and the mean of the distribution function to be modelled.

Training a Logistic Model


For a Logistic Regression problem we can use a categorical cross-entropy loss, which is given by

L=1Nj=1N[yjlog(pj)+(1yj)log(1pj)].\mathcal L = - \frac{1}{N}\sum_{j=1}^N \Big[y_{j}\log(p_{j}) + (1 - y_{j})\log(1 - p_{j}) \Big].

The optimal weights can then be found via gradient based optimisation scheme (e.g. steepest descent, Newton or Quasi-Newton).

The gradients can be calculated using chain rule, we don't do the full calculation here but

dLdw=dLdfdfdw=dLdfϕ(x)\frac{d \mathcal L}{d {\bf w}} = \frac{d \mathcal L}{d f} \cdot \frac{d f}{d {\bf w}} = \frac{d \mathcal L}{d f} \cdot\boldsymbol{\phi}({\bf x})

We note that models are linear with respect to their weights, so if we differentiate ff with respect to w{\bf w} we such get the basis functions ϕ(x){\boldsymbol \phi}({\bf x}).

This leaves us to calculate L/f\partial L / \partial f. The loss is just the loss j\ell_j for each samples added up. So we have

dLdw=dLdfϕ(x)=[1Nj=1Njf]ϕ(x)=1Nj=1Nf[yjlog(pj)+(1yj)log(1pj)]ϕ(x)=1Nj=1N[yjpj]ϕ(x)\begin{align} \frac{d \mathcal L}{d {\bf w}} &= \frac{d \mathcal L}{d f} \cdot\boldsymbol{\phi}({\bf x}) = \Big[\frac{1}{N}\sum_{j=1}^N \frac{\partial \ell_j}{\partial f}\Big]\cdot\boldsymbol{\phi}({\bf x}) \\&= \frac{1}{N}\sum*{j=1}^N \frac{\partial}{\partial f}\Big[ y*{j}\log(p*{j}) + (1 - y*{j})\log(1 - p*{j})\Big]\cdot\boldsymbol{\phi}({\bf x}) \\ &= \frac{1}{N}\sum*{j=1}^N \Big[y_j - p_j\Big] \cdot\boldsymbol{\phi}({\bf x}) \end{align}

The last part set requires a bit of manipulation, but remember that

pj=11+exp(f(xj))p_j = \frac{1}{1 + \exp(-f({\bf x}_j))}

Importance of Regularisation


Logistic regression is a convex optimization problem (the likelihood function is concave), and it's known to not have a finite solution when it can fully separate the data, so the loss function can only reach its lowest value asymptomatically as the weights tend to ±\pm \infty.

Data is fully seperable if we have limited data relative to the flexibility in the model. Which defines the classic trade off between over and underfitting.

This has the effect of tightening decision boundaries around each data point when the data is separable, and the linear model is sufficiently expressive, with this asymptotically overfitting on the training set.

academy.digilab.co.uk

Without regularization, the asymptotic nature of logistic regression would keep driving the loss towards 0 in high dimensions. There are therefore two well used strageties.

  • L2L^2 (or Ridge) Regularisation, where an additional term λw2\lambda \|{\bf w}\|^2 is added to the loss.

  • Early Stopping.

We deal with regularisation as a seperate topic. For now sklearn automatically applies ridge regularisation ('l2') by default with λ\lambda set to 1. So you now know what this this does, and understand that λ\lambda the regularisation parameter is actually a hyperparameter which you should also optimise over during training.

academy.digilab.co.uk

Here is a snapshot from the class documentation for logistic regression. Whilst there is no need to write your own code for doing it, since sklearn's implementation is good, it is important you understand the meaning of the default assumptions. Fitting a good logistic model, will often require tuning of the regularisation parameter. Here you will see they allow a user to set C=1/λC = 1/\lambda the inverse of the regularisation strength.

Multi-class Classification


In what we have discussed so far we have considered binary classification. This is where there are a choice of only two classes, labels or outcomes. In general classification can map to any number of classes, which is referred to as multi-class classification.

Multi-class classification can be achieve via a simple extension of binary classification, described above, by following an approach called One-vs-all or One-vs_rest.

academy.digilab.co.uk

Suppose we now have MM classes, logistic regression is classifiered for each class pip_i, from i=1i = 1 to MM.

We can build a model which predicts the probability of each of the classes seperate, which is a binary classifier. I.e. is it class ii or is it not class ii.

The classifier with the highest probability wins

argmaxi[pi(x)]\text{argmax}_i \Big[p_i({\bf x})\Big]

Conclusion


Logistic regression is a extension of linear models to a classification problem. It is part of a broader class of models which are called "Generalised Linear Models". These are the composition of two models

  • a linear predictor (here we define as f=wTϕf = {\bf w}^T{\boldsymbol \phi})
  • link function - in this case the logit or sigmoid function.

This second function plays the role of squashing the function between 00 and 11, transforming the output of the model to probability, which can be used for classification.

Regularisation is an important part of classification models, and when you use packages there are often chosen default parameters which play a central role in the outputs or quality of the results.

Finally, binary classification (two classes / labels) can be extended easily to multi-class classification using the so called "one-vs-all" strategy. This is nothing more than building multiple binary classifers, one for each class.