# Neural Networks Learning The Basics: Backpropagation

In this post I will go through a simple backpropagation algorithm that I have implemented on a spreadsheet. It is simple as I will use one training example with two input features to explain the concept. For example:

In the above we have a single observation with two input variables x1 and x2 and one output variable y. Having many training examples are what is common in practice.

There is ample source material available on backpropagation. However, not many that implement it using actual values and interactive tools. This post is designed to help beginners get a grasp of the concept.

It is recommended that you have knowledge on the forward pass process before reading. You can have a look at my previous content, here.

The backpropagation algorithm that I will be going through has been implemented in excel and can be downloaded here:

It is interactive and can be used by coders and non-coders who are interested in understanding the concept.

### Backpropagation Python Implementation

The spreadsheet example is implemented and extended to include more training examples in the python juypter notebook implementation that can be found on my GitHub here.

### Overview

In the post Neural Networks Learning the Basics, Layers foundation I described how inputs are transformed to outputs as they pass through the neural network. This is known as the forward pass process or forward propagation. In the example, we initialized the weights with random values. This gave us an output (the estimate y) that was pretty much sub optimal. However, we want good predictions, which means the estimated value needs to be closer to the real value. We refer to this as minimizing the loss or error (half the square difference between the estimated and the actual value). The error value provides the neural network with some guideline on how well the model is preforming. This allows the neural network to adjust the weights accordingly.  This method is called backpropagation.

The example we are going to look at is a neural network with one input layer consisting of two inputs, a hidden layer with two neurons and a bias term, and an output layer.

### Forward Propagation

Forward propagation is covered in more detail here. The process by which a neural network converts inputs into outputs. I will perform forward propagation again in the context of the neural network above. Let us start doing the math.

Hidden layer, calculating $Z_1 and Z_2$.

$Z_1 = w_1*x_0 + w_2 *x_1$
$Z_2 = w_3*x_0 + w_4 *x_1$

Applying the activation function sigmoid to the hidden layer to get $a_1$ and $a_2$.

$a_1 = f(Z_1) = \frac{1}{1+e^(-Z_1)}$
$a_2 = f(Z_2) = \frac{1}{1+e^(-Z_2)}$

Output layer calculating $a_3$. For simplicity we assume no activation on the output layer.

$a_3 = w_5*b_1 + w_6*a_1 + w_7*a_2$

Let us make $b_1$ the bias term a constant that is equal to one

$a_1^3 = w_5 + w_6*a_1 + w_7*a_2$

$a_3 = \hat{y}$

If we substitute values and implement in excel we get the output below.

That concludes forward propagation.

The main operation in backpropagation is in the calculation of gradients.

Before looking at the gradient decent let us define the loss function. Mathematically this is expressed as:

$J(\vec{w}) = \frac{1}{2} \sum_{d \epsilon D} (y_d-\hat{y_d})^2$

where $D$ is the set of training samples, $y_d$ is the target output for training example $d$ and $\hat{y_d}$ is the output calculated from the forward pass process for the training example $d$.

We want to use a neural network to make good predictions. Therefore the goal of the neural network is to learn mappings (weight parameters) that bring $\hat{y_d}$ close to $y_d$.

This means that the neural network needs to understand how a change in each component of the weight vector $\vec{w}$ affects the error $J$ and then adjust $\vec{w}$ accordingly. We therefore need to compute, the gradient of the error $J$ with respect to each component of the vector $\vec{w}$. We write this as:

$\Delta J(\vec{w}) \equiv [\frac{\partial J}{w_0}, \frac{\partial J}{w_1} ... \frac{\partial J}{w_n}]$

To understand this better, let us assume a convex solution space below:

In the figure above, lets us say the orange line represents the slope of $J$ with respect to $w_1$. The goal is to reach the minimum point. In this example if we adjust $w_1$ to the right we move away from the minimum. If we adjust $w_1$ to the left we move closer to the minimum. Knowing the gradient of $J$ with respect to $w_1$ lets the neural network know how to adjust the weight towards the minimum.

Note that not all functions will have one global minimum point and be as simple as a convex shape. You may have multiple global minimum points like the figure below.

Getting to the global minimum in this instance becomes more complex and we do not want the neural network to miss it. This is where other concepts such as learning rate and gradient decent methods may come in. We will have a closer look at these methods in upcoming post.

### Backpropagation

After we have computed compute $J$ with respect to $w_i$, that is, the rate at which the error is changing wrt the weight we want to pass this knowledge back to the neural network.

The new weight value becomes:

$w_i \leftarrow w_i - \alpha \frac{\partial J} {\partial w_i}$

Here, $\alpha$ is the learning rate also known as the step size. This constant controls how quickly we travel around the gradient.

Our main interest is in the gradient term.

$\frac{\partial J} {\partial w_i}$

For one training example problem this can be written as:

$\frac{\partial J} {\partial w_i} = - (y - \hat{y}) \frac{\partial \hat{y}} {\partial w_i}$

The term $-(y - \hat{y})$ is known as the error term delta $\delta$ we will substitute this in all the equations going forward.

As a result we therefore need the derivative of $\hat{y}$ with respect to $w_i$, therefore:

$w_i \leftarrow w_i - \alpha\delta\frac{\partial \hat{y}} {\partial w_i}$

$a_3 = w_5 + w_6 a_1 + w_7 a_2$

$\frac{\partial \hat{y}} {\partial w_5} = \delta 1$

$\frac{\partial \hat{y}} {\partial w_6} = \delta a_1$

$\frac{\partial \hat{y}} {\partial w_7} = \delta a_2$

Therefore update:

$w_{5(new)}^2 = w_{5(old)} + \alpha \delta 1$

$w_{6(new)} = w_{6(old)} + \alpha \delta a_1$

$w_{7(new)} = w_{7(old)} + \alpha \delta a_2$

Implementing this on a spreadsheet with values we get:

Hidden layer implementation

This is a bit more tricky as we now need to take into account the activation function. Here, the $\frac{\partial \hat{y}} {\partial w_i}$ becomes equal to:

$\frac{\partial \hat{y}} {\partial a} \frac{\partial a} {\partial Z} \frac{\partial Z} {\partial w}$

If we want $\frac {\partial \hat{y}} {\partial w_1}$ we would write:

$\frac{\partial \hat{y}} {\partial a_1} \frac{\partial a_1} {\partial Z_1} \frac{\partial Z_1} {\partial w_1}$

$\frac{\partial \hat{y}} {\partial a_1} = \delta w_6$

$\frac{\partial Z_1} {\partial w_1} = x_0$

The tricky one is $\frac{\partial a_1} {\partial Z_1}$. We need to first compute the derivative of the activation function.

$\frac{\partial a} {\partial Z} = - (e^{-Z})^-2(e^{-Z}(-1))$

$= \frac{e^{-Z}}{(1+e^{-Z})^2} = \frac {1}{(1+e^{-Z})}* \frac{e^{-Z}}{(1+e^{-Z})}$

Because

$a = \frac {1}{(1+e^{-Z})}$

$\frac{\partial a} {\partial Z} = a*(1-a)$

Putting this back to the weight equation we have:

$w_{1(new)} = w_{1(old)} - \alpha w_6 \delta a_1(1-a_1) x_0$

If we want $\frac {\partial \hat{y}} {\partial w_2}$ we would write:

$\frac{\partial \hat{y}} {\partial a_1} \frac{\partial a_1} {\partial Z_1} \frac{\partial Z_1} {\partial w_2}$

$w_{2(new)} = w_{2(old)} - \alpha w_6 \delta a_1(1-a_1) x_1$

If we want $\frac {\partial \hat{y_i}} {\partial w_{02}^1}$ we would write:

$\frac{\partial \hat{y}} {\partial a_2} \frac{\partial a_2} {\partial Z_2} \frac{\partial Z_2} {\partial w_3}$

$w_{3(new)} = w_{3(old)} - \alpha w_7 \delta a_2(1-a_2) x_0$

And finally, if we want $\frac {\partial \hat{y}} {\partial w_4}$ we would write:

$\frac{\partial \hat{y}} {\partial a_2} \frac{\partial a_2} {\partial Z_2} \frac{\partial Z_2} {\partial w_4}$

$w_{4(new)} = w_{4(old)} - \alpha w_7 \delta a_2 (1-a_2) x_1$

Implementing these equations on the spreadsheet we get the below:

Phew! that was a lot of math. It took me quite some time to get a grasp of it and I hope this post has been useful. If you find any errors in the above don’t hesitate to contact me. I recommend trying to replicate the spreadsheet example to check your understanding. I will expand on the spreadsheet example as I work through more concepts.

### Other Source Material

If you found my spreadsheet useful and want to understand more about backpropagation I recommend the below material:

Mitchell, Tom. (1997). Machine Learning, McGraw Hill. pp.97-123
Matt Mazur post on backpropagation explained here provides a two output example with bias term applied to the hidden layer. See if you can apply it on a spreadsheet. Further, I enjoyed his backpropagation visualization, which can be found here.
Andrew Ng provides a pretty good explanation of gradient decent in his video, here.

There is no instinct like that of the heart

Lord Byron

## 2 thoughts on “Neural Networks Learning The Basics: Backpropagation”

1. great article . Dimensions are determined by the number of features that make up an example in a datasheet not the number of example in the datasheet as you imply in your introduction . I may be wrong and I stand to be corrected but overall its a well written article.Thanks 😉.

1. Thanks you for this. I will update based on your feedback. 😁