In this post I will go through a simple backpropagation algorithm that I have implemented on a spreadsheet. It is simple as I will use one training example with two input features to explain the concept. For example:
In the above we have a single observation with two input variables x1 and x2 and one output variable y. Having many training examples are what is common in practice.
There is ample source material available on backpropagation. However, not many that implement it using actual values and interactive tools. This post is designed to help beginners get a grasp of the concept.
It is recommended that you have knowledge on the forward pass process before reading. You can have a look at my previous content, here.
Backpropagation Spreadsheet Example
The backpropagation algorithm that I will be going through has been implemented in excel and can be downloaded here:
It is interactive and can be used by coders and non-coders who are interested in understanding the concept.
Backpropagation Python Implementation
The spreadsheet example is implemented and extended to include more training examples in the python juypter notebook implementation that can be found on my GitHub here.
In the post Neural Networks Learning the Basics, Layers foundation I described how inputs are transformed to outputs as they pass through the neural network. This is known as the forward pass process or forward propagation. In the example, we initialized the weights with random values. This gave us an output (the estimate y) that was pretty much sub optimal. However, we want good predictions, which means the estimated value needs to be closer to the real value. We refer to this as minimizing the loss or error (half the square difference between the estimated and the actual value). The error value provides the neural network with some guideline on how well the model is preforming. This allows the neural network to adjust the weights accordingly. This method is called backpropagation.
The example we are going to look at is a neural network with one input layer consisting of two inputs, a hidden layer with two neurons and a bias term, and an output layer.
Forward propagation is covered in more detail here. The process by which a neural network converts inputs into outputs. I will perform forward propagation again in the context of the neural network above. Let us start doing the math.
Hidden layer, calculating .
Applying the activation function sigmoid to the hidden layer to get and .
Output layer calculating . For simplicity we assume no activation on the output layer.
Let us make the bias term a constant that is equal to one
If we substitute values and implement in excel we get the output below.
That concludes forward propagation.
The Gradient Decent
The main operation in backpropagation is in the calculation of gradients.
Before looking at the gradient decent let us define the loss function. Mathematically this is expressed as:
where is the set of training samples, is the target output for training example and is the output calculated from the forward pass process for the training example .
We want to use a neural network to make good predictions. Therefore the goal of the neural network is to learn mappings (weight parameters) that bring close to .
This means that the neural network needs to understand how a change in each component of the weight vector affects the error and then adjust accordingly. We therefore need to compute, the gradient of the error with respect to each component of the vector . We write this as:
To understand this better, let us assume a convex solution space below:
In the figure above, lets us say the orange line represents the slope of with respect to . The goal is to reach the minimum point. In this example if we adjust to the right we move away from the minimum. If we adjust to the left we move closer to the minimum. Knowing the gradient of with respect to lets the neural network know how to adjust the weight towards the minimum.
Note that not all functions will have one global minimum point and be as simple as a convex shape. You may have multiple global minimum points like the figure below.
Getting to the global minimum in this instance becomes more complex and we do not want the neural network to miss it. This is where other concepts such as learning rate and gradient decent methods may come in. We will have a closer look at these methods in upcoming post.
After we have computed compute with respect to , that is, the rate at which the error is changing wrt the weight we want to pass this knowledge back to the neural network.
The new weight value becomes:
Here, is the learning rate also known as the step size. This constant controls how quickly we travel around the gradient.
Our main interest is in the gradient term.
For one training example problem this can be written as:
The term is known as the error term delta we will substitute this in all the equations going forward.
As a result we therefore need the derivative of with respect to , therefore:
We start with the output layer referencing the froward propagation equation
Implementing this on a spreadsheet with values we get:
Hidden layer implementation
This is a bit more tricky as we now need to take into account the activation function. Here, the becomes equal to:
If we want we would write:
The tricky one is . We need to first compute the derivative of the activation function.
Putting this back to the weight equation we have:
If we want we would write:
If we want we would write:
And finally, if we want we would write:
Implementing these equations on the spreadsheet we get the below:
Phew! that was a lot of math. It took me quite some time to get a grasp of it and I hope this post has been useful. If you find any errors in the above don’t hesitate to contact me. I recommend trying to replicate the spreadsheet example to check your understanding. I will expand on the spreadsheet example as I work through more concepts.
Other Source Material
If you found my spreadsheet useful and want to understand more about backpropagation I recommend the below material:
Mitchell, Tom. (1997). Machine Learning, McGraw Hill. pp.97-123
Matt Mazur post on backpropagation explained here provides a two output example with bias term applied to the hidden layer. See if you can apply it on a spreadsheet. Further, I enjoyed his backpropagation visualization, which can be found here.
Andrew Ng provides a pretty good explanation of gradient decent in his video, here.
There is no instinct like that of the heartLord Byron