Neural Networks Learning The Basics: Backpropagation

In this post I will go through a simple backpropagation algorithm that I have implemented on a spreadsheet. It is simple as I will use one training example with two input features to explain the concept. For example:

A single training example

In the above we have a single observation with two input variables x1 and x2 and one output variable y. Having many training examples are what is common in practice.

There is ample source material available on backpropagation. However, not many that implement it using actual values and interactive tools. This post is designed to help beginners get a grasp of the concept.

It is recommended that you have knowledge on the forward pass process before reading. You can have a look at my previous content, here.

Backpropagation Spreadsheet Example

The backpropagation algorithm that I will be going through has been implemented in excel and can be downloaded here:

It is interactive and can be used by coders and non-coders who are interested in understanding the concept.

Backpropagation Python Implementation

The spreadsheet example is implemented and extended to include more training examples in the python juypter notebook implementation that can be found on my GitHub here.


In the post Neural Networks Learning the Basics, Layers foundation I described how inputs are transformed to outputs as they pass through the neural network. This is known as the forward pass process or forward propagation. In the example, we initialized the weights with random values. This gave us an output (the estimate y) that was pretty much sub optimal. However, we want good predictions, which means the estimated value needs to be closer to the real value. We refer to this as minimizing the loss or error (half the square difference between the estimated and the actual value). The error value provides the neural network with some guideline on how well the model is preforming. This allows the neural network to adjust the weights accordingly.  This method is called backpropagation.

Neural network example

The example we are going to look at is a neural network with one input layer consisting of two inputs, a hidden layer with two neurons and a bias term, and an output layer.

Forward Propagation

Forward propagation is covered in more detail here. The process by which a neural network converts inputs into outputs. I will perform forward propagation again in the context of the neural network above. Let us start doing the math.

Hidden layer, calculating Z_1 and Z_2.

Z_1 = w_1*x_0 + w_2 *x_1
Z_2 = w_3*x_0 + w_4 *x_1

Applying the activation function sigmoid to the hidden layer to get a_1 and a_2.

a_1 = f(Z_1) = \frac{1}{1+e^(-Z_1)}
a_2 = f(Z_2) = \frac{1}{1+e^(-Z_2)}

Output layer calculating a_3. For simplicity we assume no activation on the output layer.

a_3 = w_5*b_1 + w_6*a_1 + w_7*a_2

Let us make b_1 the bias term a constant that is equal to one

a_1^3 = w_5 + w_6*a_1 + w_7*a_2

a_3 = \hat{y}

If we substitute values and implement in excel we get the output below.

Forward propagation excel example

That concludes forward propagation.

The Gradient Decent

The main operation in backpropagation is in the calculation of gradients.

Before looking at the gradient decent let us define the loss function. Mathematically this is expressed as:

J(\vec{w}) = \frac{1}{2} \sum_{d \epsilon D} (y_d-\hat{y_d})^2

where D is the set of training samples, y_d is the target output for training example d and \hat{y_d} is the output calculated from the forward pass process for the training example d.

We want to use a neural network to make good predictions. Therefore the goal of the neural network is to learn mappings (weight parameters) that bring \hat{y_d} close to y_d.

This means that the neural network needs to understand how a change in each component of the weight vector \vec{w} affects the error J and then adjust \vec{w} accordingly. We therefore need to compute, the gradient of the error J with respect to each component of the vector \vec{w}. We write this as:

\Delta J(\vec{w}) \equiv [\frac{\partial J}{w_0}, \frac{\partial J}{w_1} ... \frac{\partial J}{w_n}]

To understand this better, let us assume a convex solution space below:

In the figure above, lets us say the orange line represents the slope of J with respect to w_1. The goal is to reach the minimum point. In this example if we adjust w_1 to the right we move away from the minimum. If we adjust w_1 to the left we move closer to the minimum. Knowing the gradient of J with respect to w_1 lets the neural network know how to adjust the weight towards the minimum.

Note that not all functions will have one global minimum point and be as simple as a convex shape. You may have multiple global minimum points like the figure below.

Complex loss function shape

Getting to the global minimum in this instance becomes more complex and we do not want the neural network to miss it. This is where other concepts such as learning rate and gradient decent methods may come in. We will have a closer look at these methods in upcoming post.


After we have computed compute J with respect to w_i, that is, the rate at which the error is changing wrt the weight we want to pass this knowledge back to the neural network.

The new weight value becomes:

w_i \leftarrow w_i - \alpha \frac{\partial J} {\partial w_i}

Here, \alpha is the learning rate also known as the step size. This constant controls how quickly we travel around the gradient.

Our main interest is in the gradient term.

\frac{\partial J} {\partial w_i}

For one training example problem this can be written as:

\frac{\partial J} {\partial w_i} = - (y - \hat{y}) \frac{\partial \hat{y}} {\partial w_i}

The term -(y - \hat{y}) is known as the error term delta \delta we will substitute this in all the equations going forward.

As a result we therefore need the derivative of \hat{y} with respect to w_i, therefore:

w_i \leftarrow w_i - \alpha\delta\frac{\partial \hat{y}} {\partial w_i}

We start with the output layer referencing the froward propagation equation

a_3 = w_5 + w_6 a_1 + w_7 a_2

\frac{\partial \hat{y}} {\partial  w_5} = \delta 1

\frac{\partial \hat{y}} {\partial  w_6} = \delta a_1

\frac{\partial \hat{y}} {\partial  w_7} = \delta a_2

Therefore update:

w_{5(new)}^2 =  w_{5(old)}  + \alpha \delta 1

w_{6(new)} =  w_{6(old)}  + \alpha \delta a_1

w_{7(new)} =  w_{7(old)}  + \alpha \delta a_2

Implementing this on a spreadsheet with values we get:

Output backpropagation spreadsheet implementation

Hidden layer implementation

This is a bit more tricky as we now need to take into account the activation function. Here, the \frac{\partial \hat{y}} {\partial  w_i} becomes equal to:

\frac{\partial \hat{y}} {\partial  a} \frac{\partial a} {\partial Z}  \frac{\partial Z} {\partial  w}

If we want \frac {\partial \hat{y}} {\partial  w_1} we would write:

\frac{\partial \hat{y}} {\partial  a_1} \frac{\partial a_1} {\partial Z_1}  \frac{\partial Z_1} {\partial  w_1}

\frac{\partial \hat{y}} {\partial  a_1} = \delta w_6

\frac{\partial Z_1} {\partial  w_1} = x_0

The tricky one is \frac{\partial a_1} {\partial Z_1}. We need to first compute the derivative of the activation function.

\frac{\partial a} {\partial Z} = - (e^{-Z})^-2(e^{-Z}(-1))

= \frac{e^{-Z}}{(1+e^{-Z})^2} = \frac {1}{(1+e^{-Z})}* \frac{e^{-Z}}{(1+e^{-Z})}


a =   \frac {1}{(1+e^{-Z})}

\frac{\partial a} {\partial Z} = a*(1-a)

Putting this back to the weight equation we have:

w_{1(new)} = w_{1(old)} - \alpha w_6  \delta  a_1(1-a_1) x_0

If we want \frac {\partial \hat{y}} {\partial  w_2} we would write:

\frac{\partial \hat{y}} {\partial  a_1} \frac{\partial a_1} {\partial Z_1}  \frac{\partial Z_1} {\partial  w_2}

w_{2(new)} = w_{2(old)} - \alpha w_6  \delta  a_1(1-a_1) x_1

If we want \frac {\partial \hat{y_i}} {\partial  w_{02}^1} we would write:

\frac{\partial \hat{y}} {\partial  a_2} \frac{\partial a_2} {\partial Z_2}  \frac{\partial Z_2} {\partial  w_3}

w_{3(new)} = w_{3(old)} - \alpha w_7 \delta  a_2(1-a_2) x_0

And finally, if we want \frac {\partial \hat{y}} {\partial  w_4} we would write:

\frac{\partial \hat{y}} {\partial  a_2} \frac{\partial a_2} {\partial Z_2}  \frac{\partial Z_2} {\partial  w_4}

w_{4(new)} = w_{4(old)} - \alpha w_7 \delta a_2 (1-a_2) x_1

Implementing these equations on the spreadsheet we get the below:

Hidden layer backpropagation implementation

Phew! that was a lot of math. It took me quite some time to get a grasp of it and I hope this post has been useful. If you find any errors in the above don’t hesitate to contact me. I recommend trying to replicate the spreadsheet example to check your understanding. I will expand on the spreadsheet example as I work through more concepts.

Other Source Material

If you found my spreadsheet useful and want to understand more about backpropagation I recommend the below material:

Mitchell, Tom. (1997). Machine Learning, McGraw Hill. pp.97-123
Matt Mazur post on backpropagation explained here provides a two output example with bias term applied to the hidden layer. See if you can apply it on a spreadsheet. Further, I enjoyed his backpropagation visualization, which can be found here.
Andrew Ng provides a pretty good explanation of gradient decent in his video, here.

There is no instinct like that of the heart

Lord Byron

Liked it? Take a second to support Samantha Van Der Merwe on Patreon!

2 thoughts on “Neural Networks Learning The Basics: Backpropagation

  1. great article . Dimensions are determined by the number of features that make up an example in a datasheet not the number of example in the datasheet as you imply in your introduction . I may be wrong and I stand to be corrected but overall its a well written article.Thanks 😉.

Leave a Reply