 Overview

In the previous post we looked at backpropagation. In the example we calculated the gradient of the loss with respect to each weight parameter. The gradient determined which direction to move the weight parameter and how much to move it by. The new weight parameter was then updated using the formula below: $w_{new} = w_{old} - \alpha \frac{\partial J} {\partial w_i}$

where $J$ is the error and $\alpha$ is the scaling constant known as the learning rate.

The update $\frac{\partial J} {\partial w_i}$ is called the gradient decent.

Each update made is called a step and the hope is that with each step the model gets closer and closer to the global minimum. The global minimum meaning weight parameters that best minimize the loss function there by, giving the best estimate of the target variable. In each step the weight parameters are altered in the direction of the steepest decent. How can we calculate the direction of the steepest decent? Here we need to add a gradient rule. The gradient rule which we applied in our previous post is the delta rule or standard gradient descent. Recall for a one training example problem: $\frac{\partial J} {\partial w_i} = - (y - \hat{y}) \frac{\partial \hat{y}} {\partial w_i}$

where $y$ is the target variable and $\hat{y}$ is the output obtained from forward propagation.

If we now have $d$ training examples and multidimensional weights $\frac{\partial J} {\partial w_i} = - (y_d - \hat{y_d}) \frac{\partial \hat{y_d}} {\partial w_i}$ $= \sum_{d \epsilon D} (y_d - \hat{y_d}) \frac{\partial}{\partial w_i} (y_d - \vec{w} \cdot \vec{x})$ $= \sum_{d \epsilon D} (y_d - \hat{y_d}) x_{id}$

where $x_{id}$ is the single input component $x_i$ from training set $d$

Therefore, the weight update rule for gradient descent becomes $\Delta w_i = \alpha \sum_{d \epsilon D} (y_d - \hat{y_d})x_{id}$

Note for simplicity we do not include an activation function in our example. This is therefore gradient descent implementation for linear units.

The spreadsheet implementation of gradient descent with $d$ training examples and multidimensional parameters can be downloaded here:

It is worthwhile going through this example and trying to replicate it. Only then will you truly understand how computationally expensive this process is and get an appreciation of other gradient descent modification methods. I have validated this output by replicating it in python. You can have a look at the python implementation here.

Why The Need For Modification Algorithms?

If we have a solution space like the figure below with one global minimum the gradient descent algorithm defined above will converge to a weight vector with minimum error, given that a sufficient learning rate $\alpha$ is used. If $\alpha$ is to large the algorithm may over shoot the minimum. Therefore some modification to the algorithm are made to gradually reduce $\alpha$ such as multiplying the gradient with $\frac{1}{number_of samples}$

What if they are multiple local minimum? Here, there is no guarantee that the method will find the global minimum. One common variation to gradient descent intended to overcome this is called stochastic gradient descent also known as incremental gradient decent.

Gradient descent computes the dot product of all inputs in the training set and their respective $y_d - \hat{y_d}$. This also means that $\hat{y_d}$ for all $D$ must also be determined before the weights are updated. These operations will become computationally expensive when done over a large data set. Stochastic gradient descent is a modification in that weights are updated incrementally. Here, the algorithm selects a random data point $d$ and computes the gradient of that data point and not the full training set. That is: $\Delta w_i = \alpha \sum_{d \epsilon D} (y_d - \hat{y_d})x_{id}$

The summation sign falls off and we have: $\Delta w_i = \alpha (y - \hat{y})x_{i}$

This may be a little bit difficult to grasp. I have therefore summarized in pseudo-code and have provided an interactive spreadsheet example that can be downloaded here:

initialize_weight_vector
select_learning_rate
Repeat the following until an approximate minimum is obtained
calculate_yhat
dotproduct(inputs, weight_vector)
dotproduct(inputs.T, y_hat-y)
new_weight
weight_vector = new_weight_vector

initialize_weight_vector
select_learning_rate
Repeat the following until an approximate minimum is obtained:
Randomly shuffle samples
Select example
compute yhat for single example
dotproduct(input(i), weight_vector)
dotproduct(input(i), y_hat(i)-y(i))
new_weight
weight_vector = new_weight_vector

The stochastic gradient descent therefore calculates an approximation of the gradient at each step, where as standard gradient descent calculates the gradient of the full data set at each step.

Summary

We can summarize the three key differences between gradient descent and stochastic gradient descent as:

In standard gradient descent the error is summed over all examples before a weight can be updated. In stochastic gradient descent, the data set is shuffled and updates are made upon examining a training example.

Standard gradient decent is more computationally expensive.

Stochastic gradient decent has a better chance of finding the global minimum when there is more than one minimum point in the solution space. Let us visualize this using the excel example, we plot error vs weight parameter for each step:

In the above figures, we can see that SGD is more flexible when exploring the solution space. SGD is not ideal for our linear simple example but will however be much more capable at solving complex problems.

They are many more modifications of gradient descent. There are also no solid principles to guide on choices. In most papers I have read, guidance is received simply by using what worked for others when trying to solve a similar problem.