Dr Andy Corbett

by Dr Andy Corbett

Lesson

4. Bonus Example on Backpropagation

In this video you will...
  • ✅ Take a short mathematical tour on the ins/outs of back propagation.
  • ✅ Have a go at the chain rule.
  • ✅ Compute gradients.
  • ✅ Update parameters.

This video is "extra-curricular". You are more than welcome to skip it, toss it, never look at it again. Why? We're diving into he mathematics that allows us to perform backpropagation. There are plenty algorithms out there designed to keep track of this under the hood--and we shall explore these shortly.

But if you're like me, maybe you just want to know! So pull out you pencil and paper and see if you can follow along yourself.

A forward pass ends in computing the loss.


Here's our neural network f ⁣:RnRf\colon \mathbb{R}^n \rightarrow \mathbb{R}; that means it is a function from our input space of vectors Rn\mathbb{R}^n to a scalar value R\mathbb{R}.

Suppose we have a training set of DD input-output data points (xd,yd)(\mathbf{x}_d, y_d) for d=1,,Dd=1,\ldots, D.

Then after passing xd\mathbf{x}_d through the network our next step is to assess the quality of the answer; we want to compare f(xd)f(\mathbf{x}_d) against the ground truth, ydy_d.

In this example, we perform this with a least-squares (MSE) loss function. Such a loss function would be appropriate for regression problems. The function can be written as

L=1Dd=1D(ydf(xd))2.L = \frac{1}{D} \sum_{d=1}^{D} (y_d - f(\mathbf{x}_d))^2.

Let's talk about gradients


What is all this chatter about gradients? What is a gradient? Quite simply, it is the rate at which something changes at a particular point.

If the loss LL of our neural network responds to an input x\mathbf{x}, then we can ask ourselves "what happens if we perturb x\mathbf{x} a little bit?", by x+ε\mathbf{x} + \varepsilon, say. How sensitive is LL to a little change by ε\varepsilon?

If LL is highly sensitive and a little shift in x\mathbf{x} shows it rapidly decreases, then that's the direction we want to follow! We can use gradients to work out where the loss becomes smaller.

Gradients of a single neuron


A single neuron takes the form

f(x)=σ(wx+b)f(\mathbf{x}) = \sigma(\mathbf{w}\cdot\mathbf{x} + b)

where σ\sigma is the logistic function we saw in previous videos: σ(r)=(1+rr)1\sigma(r) = (1 + r^{-r})^{-1}. We are looking for the gradients with respect to the neuron parameters given by

LwandLb.\frac{\partial L}{\partial \mathbf{w}}\quad \text{and}\quad \frac{\partial L}{\partial b}.

Now it's time for you pencil and paper.

Step 1: Apply the chain rule

The chain rule allows us to differentiate compositions of functions.

For example, LL is a function of ff and ff is a function of w\mathbf{w} and bb.

For each component w=[wi]i\mathbf{w} = [w_{i}]_i, the formula gives us

Lwi=1Dd=1DL(yd,f(xd))f(xd)f(xd)wi=1Dd=1D2(ydf(xd))f(xd)wi.\frac{\partial L}{\partial w_i} =\frac{1}{D}\sum_{d=1}^{D} \frac{\partial L(y_{d}, f(\mathbf{x}_{d}))}{\partial f(\mathbf{x}_{d})}\cdot \frac{\partial f(\mathbf{x}_{d})}{\partial w_{i}} =\frac{1}{D}\sum_{d=1}^{D} -2(y_{d} - f(\mathbf{x}_{d})) \cdot \frac{\partial f(\mathbf{x}_{d})}{\partial w_{i}}.

Step 2: Apply the chain rule again

Now we need to find the second derivative above:

f(xd)wi=wi[σ(wx+b)]=σ(wx+b)xi\frac{\partial f(\mathbf{x}_{d})}{\partial w_{i}} = \frac{\partial}{\partial w_i}[\sigma(\mathbf{w}\cdot\mathbf{x} + b)] = \sigma'(\mathbf{w}\cdot\mathbf{x} + b)\cdot x_i

Step 3: Finally, differentiate the activation

The last unknown is the derivative of the logistic function itself. This is equal to

σ(r)=er(1+er)2=σ(r)(1σ(r)).\sigma'(r) = e^r\cdot(1 + e^{-r})^{-2} = \sigma(r)(1-\sigma(r)).

And now we can wrap this into the code.

Updating parameters: Stochastic Gradient Descent (SGD)


Let's finish where we started, why gradients? We have already noted that gradients point to the direction of maximum change, and we can tell whether that change is positive or negative.

There is an established algorithm which tells us that the minimum of a function, like LL, can be found by making updates

wwηLw\mathbf{w} \leftarrow \mathbf{w} - \eta\cdot \frac{\partial L}{\partial \mathbf{w}}

where η\eta is a parameter called the learning rate. Tempering η\eta can put the breaks on and stop the model wildly diverging.

Challenge for you...


We computed the loss of a single neuron f(x)=σ(wx+b)f(\mathbf{x}) = \sigma(\mathbf{w}\cdot\mathbf{x} + b). Try writing down a larger neural network and computing the gradients for all parameters.