Loss Functions

Machine Learning Workflow : Loss Functions

In this explainer we define what a loss function is and how it is a central part of training a supervised learning algorithm. In particular, we will cover

The key concept and properties of a loss function
Examples of common loss functions for regression and classification tasks.

So looking at a supervised learning algorithm, we have our normal labelled dataset

X := \{({\bf x}_0, y_0), ({\bf x}_1, y_1), \ldots, ({\bf x}_N, y_N)\}

The aim of the game is use an algorithm to approximate the mapping from ${\bf x}$ to $y$ , given this set of example / training set. The hope then is that the predictions of the approximation then generalised to unseen examples.

So let us say our trained supervised learning algorithm makes the predictions $\tilde y({\bf x})$ , then when accessing the quality of the model we need some measure of how well it fits known examples.

A loss function is a mathematical function used to measure how well a machine learning model is performing on a given task. It calculates the difference between the predicted output of the model and the actual output (i.e., the target variable) for a particular input data point.

There are various choices of loss function, each of which depend on the task at hand, but some broad properties we require of a loss function

Non-negativity [Essential]: The loss function should always output non-negative values, this is an essential requirement. This is because a "perfect" is one that recovers the data exactly, which has a loss of zero, nothing can improve on this as a prediction.
Differentiable [Desired] : Since many algorithms are trained using gradient based optimisation methods (the most common is stochastic gradient), then differentiability means that optimisation strategies work effectively.

Loss functions for Regression Tasks

First we consider the most common loss functions for regression based tasks. I.e. tasks for which the output or target variable is continuous.

Mean Squared Error (MSE) Loss

This is the most commonly used loss function for regression. It measures the average squared difference between the predicted and actual values. The formula for MSE loss is:

\mathcal L = \frac{1}{N} \sum_{j=1}^N \Big(y_j - \tilde y({\bf x}_{j})\Big)^2

for $y_j$ true outputs, $\tilde y({\bf x}_{j})$ and predicted outputs.

Mean Absolute Error (MAE) Loss

This loss function measures the absolute difference between the predicted and actual values. The formula for MAE loss is:

\mathcal L = \frac{1}{N} \sum_{j=1}^N |y_j - \tilde y({\bf x}_{j})|

where y is the true output, f(x) is the predicted output, and n is the number of samples.

Log-Cosh Loss

This loss function is a smooth approximation of the MAE loss that is also less sensitive to outliers than MSE loss. The formula for Log-Cosh loss is:

\mathcal L = \frac{1}{N} \sum_{i=1}^N \log \Big(\cosh\Big[y_j - \tilde y({\bf x}_j)\Big]\Big)

where $\cosh(\cdot)$ is the hyperbolic cosine function.

Loss functions for Classification Tasks

Second we consider the most common loss functions for classification based tasks. I.e. tasks for which the output or target variable is discrete label.

Categorical Cross-Entropy Loss

This is a loss function used for multi-class classification tasks where the output is a probability distribution over multiple classes. It measures the difference between the predicted probability distribution and the true class labels.

The formula for categorical cross-entropy loss for $C$ label classification problem over $N$ samples

\mathcal L = \sum_{j=1}^N\left( - \sum_{i=1}^C p_{ij}\log(\tilde{p}_{ij})\right).

Here if $p_{ij}$ is the true probability for class $i$ of sample $j$ , i.e. $p_{ij} = 1$

p_{ij} = \begin{cases} 1 \quad \text{if} \quad y_j = i \\ 0 \quad \text{otherwise} \end{cases}

Where as $\tilde{p}_{ij}$ is the predicted probability of classifying sample with input ${\bf x}_j$ into class $i$ , which comes out our model.

So looking at the simplest case for a binary classification class, i.e two labels either $0$ or $1$ , then if the true probability of classifying as classification $0$ is $p_{0j}$ then the probability of classifying as a $1$ is

p_{1j} = 1 - p_{0j},

hence the loss is

\mathcal L = - \sum_{j=1}^N \Big[p_{0j}\log(\tilde{p}_{0j}) + (1 - p_{0j})\log(1 - \tilde{p}_{0j}) \Big]

Hinge Loss

This is a function used for binary classification tasks, where the output is a scalar rather than a probability. It measures the difference between the predicted scalar value and the true binary label, and penalizes the model more heavily for incorrect predictions that are farther from the decision boundary.

In this binary classification, we have to change our labels to either $-1$ or $1$ . This is an important step, otherwise you will get the odd results!

The formular for the hinged loss is

\mathcal L = \sum_{j=1}^N\Big[ \max \Big(0, 1 - y_j \cdot \tilde y({\bf x}_j)\Big)\Big]

So let's unpick this a little.

Suppose that the true label is $y_j = -1$ . If the prediction from the model is then $\tilde y = -1 + \alpha$ , where $\alpha$ is between $0$ and $2$ . Then we see that the loss is the $\max(0, \alpha) = \alpha$ . So the loss increases to a maximum of $2$ the further it is from the correct decision.

The same happens with when the true label is $y_j=1$ . So let's assume the model predicts $\tilde y_j = 1 - \alpha$ again where $\alpha$ is between $0$ and $2$ , we will see that the loss is $\max (0, 1 - (1 - \alpha)) = \alpha$ . Again in this case the contribution to the overal loss increases to a maximum of $2$ .

Hinge loss is commonly used in support vector machines (SVMs), a popular classification algorithm. SVMs aim to find the hyperplane that maximizes the margin between the two classes. Hinge loss is used as the loss function in SVMs because it encourages the model to find the decision boundary with the maximum margin.

Concluding Remarks

A loss is a measure of the quality of a predictive model over either a training or test set. There are a number of choices, which depend on the task at hand, for example if the task is a regression or classification problem. Smoothness of the loss function, i.e. no jumps or sharp corners is preferred, since most ML algorithms are trained with gradient based methods.

Machine Learning Workflow

10. Loss Functions