by Dr Andy Corbett
4. Bonus Example on Backpropagation
- ✅ Take a short mathematical tour on the ins/outs of back propagation.
- ✅ Have a go at the chain rule.
- ✅ Compute gradients.
- ✅ Update parameters.
This video is "extra-curricular". You are more than welcome to skip it, toss it, never look at it again. Why? We're diving into he mathematics that allows us to perform backpropagation. There are plenty algorithms out there designed to keep track of this under the hood--and we shall explore these shortly.
But if you're like me, maybe you just want to know! So pull out you pencil and paper and see if you can follow along yourself.
A forward pass ends in computing the loss.
Here's our neural network ; that means it is a function from our input space of vectors to a scalar value .
Suppose we have a training set of input-output data points for .
Then after passing through the network our next step is to assess the quality of the answer; we want to compare against the ground truth, .
In this example, we perform this with a least-squares (MSE) loss function. Such a loss function would be appropriate for regression problems. The function can be written as
Let's talk about gradients
What is all this chatter about gradients? What is a gradient? Quite simply, it is the rate at which something changes at a particular point.
If the loss of our neural network responds to an input , then we can ask ourselves "what happens if we perturb a little bit?", by , say. How sensitive is to a little change by ?
If is highly sensitive and a little shift in shows it rapidly decreases, then that's the direction we want to follow! We can use gradients to work out where the loss becomes smaller.
Gradients of a single neuron
A single neuron takes the form
where is the logistic function we saw in previous videos: . We are looking for the gradients with respect to the neuron parameters given by
Now it's time for you pencil and paper.
Step 1: Apply the chain rule
The chain rule allows us to differentiate compositions of functions.
For example, is a function of and is a function of and .
For each component , the formula gives us
Step 2: Apply the chain rule again
Now we need to find the second derivative above:
Step 3: Finally, differentiate the activation
The last unknown is the derivative of the logistic function itself. This is equal to
And now we can wrap this into the code.
Updating parameters: Stochastic Gradient Descent (SGD)
Let's finish where we started, why gradients? We have already noted that gradients point to the direction of maximum change, and we can tell whether that change is positive or negative.
There is an established algorithm which tells us that the minimum of a function, like , can be found by making updates
where is a parameter called the learning rate. Tempering can put the breaks on and stop the model wildly diverging.
Challenge for you...
We computed the loss of a single neuron . Try writing down a larger neural network and computing the gradients for all parameters.