Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

Machine Learning Workflow

10. Loss Functions

Machine Learning Workflow : Loss Functions


In this explainer we define what a loss function is and how it is a central part of training a supervised learning algorithm. In particular, we will cover

  • The key concept and properties of a loss function
  • Examples of common loss functions for regression and classification tasks.

So looking at a supervised learning algorithm, we have our normal labelled dataset

X:={(x0,y0),(x1,y1),,(xN,yN)}X := \{({\bf x}_0, y_0), ({\bf x}_1, y_1), \ldots, ({\bf x}_N, y_N)\}

The aim of the game is use an algorithm to approximate the mapping from x{\bf x} to yy, given this set of example / training set. The hope then is that the predictions of the approximation then generalised to unseen examples.

So let us say our trained supervised learning algorithm makes the predictions y~(x)\tilde y({\bf x}), then when accessing the quality of the model we need some measure of how well it fits known examples.

A loss function is a mathematical function used to measure how well a machine learning model is performing on a given task. It calculates the difference between the predicted output of the model and the actual output (i.e., the target variable) for a particular input data point.

There are various choices of loss function, each of which depend on the task at hand, but some broad properties we require of a loss function

  1. Non-negativity [Essential]: The loss function should always output non-negative values, this is an essential requirement. This is because a "perfect" is one that recovers the data exactly, which has a loss of zero, nothing can improve on this as a prediction.

  2. Differentiable [Desired] : Since many algorithms are trained using gradient based optimisation methods (the most common is stochastic gradient), then differentiability means that optimisation strategies work effectively.

Loss functions for Regression Tasks


First we consider the most common loss functions for regression based tasks. I.e. tasks for which the output or target variable is continuous.

Mean Squared Error (MSE) Loss

This is the most commonly used loss function for regression. It measures the average squared difference between the predicted and actual values. The formula for MSE loss is:

L=1Nj=1N(yjy~(xj))2\mathcal L = \frac{1}{N} \sum_{j=1}^N \Big(y_j - \tilde y({\bf x}_{j})\Big)^2

for yjy_j true outputs, y~(xj)\tilde y({\bf x}_{j}) and predicted outputs.

Mean Absolute Error (MAE) Loss

This loss function measures the absolute difference between the predicted and actual values. The formula for MAE loss is:

L=1Nj=1Nyjy~(xj)\mathcal L = \frac{1}{N} \sum_{j=1}^N |y_j - \tilde y({\bf x}_{j})|

where y is the true output, f(x) is the predicted output, and n is the number of samples.

Log-Cosh Loss

This loss function is a smooth approximation of the MAE loss that is also less sensitive to outliers than MSE loss. The formula for Log-Cosh loss is:

L=1Ni=1Nlog(cosh[yjy~(xj)])\mathcal L = \frac{1}{N} \sum_{i=1}^N \log \Big(\cosh\Big[y_j - \tilde y({\bf x}_j)\Big]\Big)

where cosh()\cosh(\cdot) is the hyperbolic cosine function.

Loss functions for Classification Tasks


Second we consider the most common loss functions for classification based tasks. I.e. tasks for which the output or target variable is discrete label.

Categorical Cross-Entropy Loss

This is a loss function used for multi-class classification tasks where the output is a probability distribution over multiple classes. It measures the difference between the predicted probability distribution and the true class labels.

The formula for categorical cross-entropy loss for CC label classification problem over NN samples

L=j=1N(i=1Cpijlog(p~ij)).\mathcal L = \sum_{j=1}^N\left( - \sum_{i=1}^C p_{ij}\log(\tilde{p}_{ij})\right).

Here if pijp_{ij} is the true probability for class ii of sample jj, i.e. pij=1p_{ij} = 1

pij={1ifyj=i0otherwisep_{ij} = \begin{cases} 1 \quad \text{if} \quad y_j = i \\ 0 \quad \text{otherwise} \end{cases}

Where as p~ij\tilde{p}_{ij} is the predicted probability of classifying sample with input xj{\bf x}_j into class ii, which comes out our model.

So looking at the simplest case for a binary classification class, i.e two labels either 00 or 11, then if the true probability of classifying as classification 00 is p0jp_{0j} then the probability of classifying as a 11 is

p1j=1p0j,p_{1j} = 1 - p_{0j},

hence the loss is

L=j=1N[p0jlog(p~0j)+(1p0j)log(1p~0j)]\mathcal L = - \sum_{j=1}^N \Big[p_{0j}\log(\tilde{p}_{0j}) + (1 - p_{0j})\log(1 - \tilde{p}_{0j}) \Big]

Hinge Loss


This is a function used for binary classification tasks, where the output is a scalar rather than a probability. It measures the difference between the predicted scalar value and the true binary label, and penalizes the model more heavily for incorrect predictions that are farther from the decision boundary.

In this binary classification, we have to change our labels to either 1-1 or 11. This is an important step, otherwise you will get the odd results!

The formular for the hinged loss is

L=j=1N[max(0,1yjy~(xj))]\mathcal L = \sum_{j=1}^N\Big[ \max \Big(0, 1 - y_j \cdot \tilde y({\bf x}_j)\Big)\Big]

So let's unpick this a little.

Suppose that the true label is yj=1y_j = -1. If the prediction from the model is then y~=1+α\tilde y = -1 + \alpha, where α\alpha is between 00 and 22. Then we see that the loss is the max(0,α)=α\max(0, \alpha) = \alpha. So the loss increases to a maximum of 22 the further it is from the correct decision.

The same happens with when the true label is yj=1y_j=1. So let's assume the model predicts y~j=1α\tilde y_j = 1 - \alpha again where α\alpha is between 00 and 22, we will see that the loss is max(0,1(1α))=α\max (0, 1 - (1 - \alpha)) = \alpha. Again in this case the contribution to the overal loss increases to a maximum of 22.

Hinge loss is commonly used in support vector machines (SVMs), a popular classification algorithm. SVMs aim to find the hyperplane that maximizes the margin between the two classes. Hinge loss is used as the loss function in SVMs because it encourages the model to find the decision boundary with the maximum margin.

Concluding Remarks


A loss is a measure of the quality of a predictive model over either a training or test set. There are a number of choices, which depend on the task at hand, for example if the task is a regression or classification problem. Smoothness of the loss function, i.e. no jumps or sharp corners is preferred, since most ML algorithms are trained with gradient based methods.