by Prof Tim Dodwell
Machine Learning Workflow
10. Loss Functions
Machine Learning Workflow : Loss Functions
In this explainer we define what a loss function is and how it is a central part of training a supervised learning algorithm. In particular, we will cover
- The key concept and properties of a loss function
- Examples of common loss functions for regression and classification tasks.
So looking at a supervised learning algorithm, we have our normal labelled dataset
The aim of the game is use an algorithm to approximate the mapping from to , given this set of example / training set. The hope then is that the predictions of the approximation then generalised to unseen examples.
So let us say our trained supervised learning algorithm makes the predictions , then when accessing the quality of the model we need some measure of how well it fits known examples.
A loss function is a mathematical function used to measure how well a machine learning model is performing on a given task. It calculates the difference between the predicted output of the model and the actual output (i.e., the target variable) for a particular input data point.
There are various choices of loss function, each of which depend on the task at hand, but some broad properties we require of a loss function
-
Non-negativity [Essential]: The loss function should always output non-negative values, this is an essential requirement. This is because a "perfect" is one that recovers the data exactly, which has a loss of zero, nothing can improve on this as a prediction.
-
Differentiable [Desired] : Since many algorithms are trained using gradient based optimisation methods (the most common is stochastic gradient), then differentiability means that optimisation strategies work effectively.
Loss functions for Regression Tasks
First we consider the most common loss functions for regression based tasks. I.e. tasks for which the output or target variable is continuous.
Mean Squared Error (MSE) Loss
This is the most commonly used loss function for regression. It measures the average squared difference between the predicted and actual values. The formula for MSE loss is:
for true outputs, and predicted outputs.
Mean Absolute Error (MAE) Loss
This loss function measures the absolute difference between the predicted and actual values. The formula for MAE loss is:
where y is the true output, f(x) is the predicted output, and n is the number of samples.
Log-Cosh Loss
This loss function is a smooth approximation of the MAE loss that is also less sensitive to outliers than MSE loss. The formula for Log-Cosh loss is:
where is the hyperbolic cosine function.
Loss functions for Classification Tasks
Second we consider the most common loss functions for classification based tasks. I.e. tasks for which the output or target variable is discrete label.
Categorical Cross-Entropy Loss
This is a loss function used for multi-class classification tasks where the output is a probability distribution over multiple classes. It measures the difference between the predicted probability distribution and the true class labels.
The formula for categorical cross-entropy loss for label classification problem over samples
Here if is the true probability for class of sample , i.e.
Where as is the predicted probability of classifying sample with input into class , which comes out our model.
So looking at the simplest case for a binary classification class, i.e two labels either or , then if the true probability of classifying as classification is then the probability of classifying as a is
hence the loss is
Hinge Loss
This is a function used for binary classification tasks, where the output is a scalar rather than a probability. It measures the difference between the predicted scalar value and the true binary label, and penalizes the model more heavily for incorrect predictions that are farther from the decision boundary.
In this binary classification, we have to change our labels to either or . This is an important step, otherwise you will get the odd results!
The formular for the hinged loss is
So let's unpick this a little.
Suppose that the true label is . If the prediction from the model is then , where is between and . Then we see that the loss is the . So the loss increases to a maximum of the further it is from the correct decision.
The same happens with when the true label is . So let's assume the model predicts again where is between and , we will see that the loss is . Again in this case the contribution to the overal loss increases to a maximum of .
Hinge loss is commonly used in support vector machines (SVMs), a popular classification algorithm. SVMs aim to find the hyperplane that maximizes the margin between the two classes. Hinge loss is used as the loss function in SVMs because it encourages the model to find the decision boundary with the maximum margin.
Concluding Remarks
A loss is a measure of the quality of a predictive model over either a training or test set. There are a number of choices, which depend on the task at hand, for example if the task is a regression or classification problem. Smoothness of the loss function, i.e. no jumps or sharp corners is preferred, since most ML algorithms are trained with gradient based methods.