Dr Andy Corbett

by Dr Andy Corbett

Lesson

7. Activations and Loss Functions

In this video you will...
  • ✅ See a variety of different activation functions and learn how and when to use them.
  • ✅ Observe different loss functions to measure the success of model performance.
  • ✅ Discern between different metrics for different types of task.

Design choices: what to choose and why


As we saw when designing neural networks, there were many 'hyperparameter' choices to make. In this video, we shall take a short tour such choices for Loss functions and activation functions.

Picking your loss functions


Loss functions assess how close our predictions are to the ground truth values. They need to be picked bespoke to specific problems for an effective comparison. For instance, if we have a binary problem, looking at the data on a continuous scale shan't be much use. So we being by splitting the problem into two classes.

Regression of classification

Fig. 1. Do we have a classification or a regression problem?

In this video we give a visual description of the different loss functions that can be applied to these problems.

Regression Problems These are problems whose target is a continuous variable. We there are two popular choices of loss function to choose between for different goals:

  • Mean-Squared Error:LMSE=1Ni=1N(yitrueyipred)2L_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N}(y_{i}^{\text{true}} - y_{i}^{\text{pred}})^{2} where NN is the number of samples. We use this function to which deviate far from the ground truth.
  • Mean Absolute Error:LMAE=1Ni=1N(yitrueyipred)L_{\text{MAE}} = \frac{1}{N}\sum_{i=1}^{N}(y_{i}^{\text{true}} - y_{i}^{\text{pred}})This function is chosen to penalise noisy outputs which are oscillating around the true values.

Classification Problems These are problems where we have a number of distinct classes in which the target can live. We distinguish betwen two and many classes.

  • Binary Cross-Entropy: LBCE=1Ni=1N[yitruelog(yipred)+(1yitrue)log(1yipred)]L_{BCE} = \frac{-1}{N} \sum_{i = 1}^{N}[y_{i}^{\text{true}}\cdot\log(y_{i}^{\text{pred}}) + (1- y_{i}^{\text{true}})\cdot\log(1- y_{i}^{\text{pred}})] where yipredy_{i}^{\text{pred}} is interpreted as the probability that the model is in the positive class; i.e. the activated output. The leading terms in this loss function depending on yitruey_{i}^{\text{true}} determine the class and act as an indicator to which term contributes. Then second terms are then the negative log-likely hoods predicted by the the model outputs--in face we shall see that the negative log-likelihood loss and BCE loss can be used interchangeably.
  • Categorical cross entropy loss is an extension of the BCE loss for multi-class problems. It is similarly based on the negative log-likelihood loss.

Choosing between activations


In the video we shall give a whilst-stop tour of the following activations, each serving different purposes in neural network architectures.

  1. Sigmoid logistic.σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}This function is commonly used on output layers when predicting probabilities (values within [0,1][0,1]).
  2. Hyperbolic tangent.tanh(x)=exexex+ex\tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}This function has similar properties to σ\sigma but sits in the range [1,1][-1,1], symmetric about zero. For this reason it is sometimes used in intermediate layers where the nodes need to be regularised but remain unbiased, with mean zero. It can also be use on output layers to indicate where an item is for (+1), agains (-1) or neutral (0) to the outcome.
  3. Rectified Linear Unit, ReLU.ReLU(x)=maxx,0\text{ReLU}(x) = \max{x, 0}This is the most common activation to use on hidden nodes as it is cheap to evaluate and to differentiate. However regions of parameter space can be hard to explore.
  4. Leaky ReLU.LeakyReLU(x)=maxx,ax\text{LeakyReLU}(x) = \max{x, ax}This remedies the exploration bottleneck of the ReLU by allowing a slope into negative space.
  5. Softplus.Softplus(x)=1βlog(1+eβx)\text{Softplus}(x)=\frac{1}{\beta}\log(1 + e^{\beta x})Keep in your pocket if you want smooth derivatives: Softplus approximates ReLU as β\beta\rightarrow\infty.
  6. Max (for multi-objective outputs). returns the max of the ourputs.
  7. Softmax (for multi-objective outputs) is a smooth way to return the max whilst normalising against the other components.