### Overview

In the previous post we looked at backpropagation. In the example we calculated the gradient of the loss with respect to each weight parameter. The gradient determined which direction to move the weight parameter and how much to move it by. The new weight parameter was then updated using the formula below:

where is the error and is the scaling constant known as the learning rate.

The update is called the gradient decent.

Each update made is called a step and the hope is that with each step the model gets closer and closer to the global minimum. The global minimum meaning weight parameters that best minimize the loss function there by, giving the best estimate of the target variable. In each step the weight parameters are altered in the direction of the steepest decent. How can we calculate the direction of the steepest decent? Here we need to add a gradient rule. The gradient rule which we applied in our previous post is the delta rule or standard gradient descent. Recall for a one training example problem:

where is the target variable and is the output obtained from forward propagation.

If we now have training examples and multidimensional weights

where is the single input component from training set

Therefore, the weight update rule for gradient descent becomes

Note for simplicity we do not include an activation function in our example. This is therefore gradient descent implementation for linear units.

### Spreadsheet Implementation

The spreadsheet implementation of gradient descent with training examples and multidimensional parameters can be downloaded here:

It is worthwhile going through this example and trying to replicate it. Only then will you truly understand how computationally expensive this process is and get an appreciation of other gradient descent modification methods. I have validated this output by replicating it in python. You can have a look at the python implementation here.

### Why The Need For Modification Algorithms?

If we have a solution space like the figure below with one global minimum the gradient descent algorithm defined above will converge to a weight vector with minimum error, given that a sufficient learning rate is used. If is to large the algorithm may over shoot the minimum. Therefore some modification to the algorithm are made to gradually reduce such as multiplying the gradient with

What if they are multiple local minimum? Here, there is no guarantee that the method will find the global minimum. One common variation to gradient descent intended to overcome this is called stochastic gradient descent also known as incremental gradient decent.

### Stochastic Gradient Descent

Gradient descent computes the dot product of all inputs in the training set and their respective . This also means that for all must also be determined before the weights are updated. These operations will become computationally expensive when done over a large data set. Stochastic gradient descent is a modification in that weights are updated incrementally. Here, the algorithm selects a random data point and computes the gradient of that data point and not the full training set. That is:

Instead of:

The summation sign falls off and we have:

### Spreadsheet Implementation

This may be a little bit difficult to grasp. I have therefore summarized in pseudo-code and have provided an interactive spreadsheet example that can be downloaded here:

**Gradient descent**

```
#gradient descent
initialize_weight_vector
select_learning_rate
Repeat the following until an approximate minimum is obtained
calculate_yhat
dotproduct(inputs, weight_vector)
gradient_rule
dotproduct(inputs.T, y_hat-y)
new_weight
new_weight_vector = old_weight - learning_rate*gradient_rule
weight_vector = new_weight_vector
```

**Stochastic gradient descent**

```
#stochastic gradient descent
initialize_weight_vector
select_learning_rate
Repeat the following until an approximate minimum is obtained:
Randomly shuffle samples
Select example
compute yhat for single example
dotproduct(input(i), weight_vector)
compute gradient for single example
dotproduct(input(i), y_hat(i)-y(i))
new_weight
new_weight_vector = old_weight - learning_rate*gradient_rule
weight_vector = new_weight_vector
```

The stochastic gradient descent therefore calculates an approximation of the gradient at each step, where as standard gradient descent calculates the gradient of the full data set at each step.

### Summary

We can summarize the three key differences between gradient descent and stochastic gradient descent as:

In standard gradient descent the error is summed over all examples before a weight can be updated. In stochastic gradient descent, the data set is shuffled and updates are made upon examining a training example.

Standard gradient decent is more computationally expensive.

Stochastic gradient decent has a better chance of finding the global minimum when there is more than one minimum point in the solution space. Let us visualize this using the excel example, we plot error vs weight parameter for each step:

In the above figures, we can see that SGD is more flexible when exploring the solution space. SGD is not ideal for our linear simple example but will however be much more capable at solving complex problems.

They are many more modifications of gradient descent. There are also no solid principles to guide on choices. In most papers I have read, guidance is received simply by using what worked for others when trying to solve a similar problem.

### Further Reading

For more on gradient decent have a look through the below recommended material:

Jeremy Howard’s fastai explanation

Andrew Ng watch video here

Text book: Mitchell, Tom. (1997). Machine Learning, McGraw Hill. pp.93-95

10 Gradient Descent Optimization Algorithms and Cheat Sheet blog post here

### Conclusion

The main purpose of this blog post was to motivate why modifications of gradient descent optimization algorithms are fundamental to neural networks. More specifically for solving complex problems. I hope this has been useful.

This concludes the theoretical part of Neural Networks Learning The Basics series. I will be introducing some practical examples in the next post.