Neural Networks Learning The Basics: Gradient Descent and Stochastic Gradient Descent

Overview

In the previous post we looked at backpropagation. In the example we calculated the gradient of the loss with respect to each weight parameter. The gradient determined which direction to move the weight parameter and how much to move it by. The new weight parameter was then updated using the formula below:

w_{new} = w_{old} - \alpha \frac{\partial J} {\partial w_i}

where J is the error and \alpha is the scaling constant known as the learning rate.

The update \frac{\partial J} {\partial w_i} is called the gradient decent.

Each update made is called a step and the hope is that with each step the model gets closer and closer to the global minimum. The global minimum meaning weight parameters that best minimize the loss function there by, giving the best estimate of the target variable. In each step the weight parameters are altered in the direction of the steepest decent. How can we calculate the direction of the steepest decent? Here we need to add a gradient rule. The gradient rule which we applied in our previous post is the delta rule or standard gradient descent. Recall for a one training example problem:

\frac{\partial J} {\partial w_i} = - (y - \hat{y}) \frac{\partial \hat{y}} {\partial w_i}

where y is the target variable and \hat{y} is the output obtained from forward propagation.

If we now have d training examples and multidimensional weights

\frac{\partial J} {\partial w_i} = - (y_d - \hat{y_d}) \frac{\partial \hat{y_d}} {\partial w_i}

= \sum_{d  \epsilon  D} (y_d - \hat{y_d}) \frac{\partial}{\partial w_i} (y_d - \vec{w} \cdot \vec{x})

=   \sum_{d  \epsilon  D} (y_d - \hat{y_d}) x_{id}

where x_{id} is the single input component x_i from training set d

Therefore, the weight update rule for gradient descent becomes

\Delta w_i = \alpha \sum_{d  \epsilon  D} (y_d - \hat{y_d})x_{id}

Note for simplicity we do not include an activation function in our example. This is therefore gradient descent implementation for linear units.

Spreadsheet Implementation

The spreadsheet implementation of gradient descent with d training examples and multidimensional parameters can be downloaded here:

It is worthwhile going through this example and trying to replicate it. Only then will you truly understand how computationally expensive this process is and get an appreciation of other gradient descent modification methods. I have validated this output by replicating it in python. You can have a look at the python implementation here.

Why The Need For Modification Algorithms?

If we have a solution space like the figure below with one global minimum the gradient descent algorithm defined above will converge to a weight vector with minimum error, given that a sufficient learning rate \alpha is used. If \alpha is to large the algorithm may over shoot the minimum. Therefore some modification to the algorithm are made to gradually reduce \alpha such as multiplying the gradient with \frac{1}{number_of samples}

What if they are multiple local minimum? Here, there is no guarantee that the method will find the global minimum. One common variation to gradient descent intended to overcome this is called stochastic gradient descent also known as incremental gradient decent.

Gradient decent visualization of solution space

Stochastic Gradient Descent

Gradient descent computes the dot product of all inputs in the training set and their respective y_d - \hat{y_d}. This also means that \hat{y_d} for all D must also be determined before the weights are updated. These operations will become computationally expensive when done over a large data set. Stochastic gradient descent is a modification in that weights are updated incrementally. Here, the algorithm selects a random data point d and computes the gradient of that data point and not the full training set. That is:

Instead of:

\Delta w_i = \alpha \sum_{d  \epsilon  D} (y_d - \hat{y_d})x_{id}

The summation sign falls off and we have:

\Delta w_i = \alpha (y - \hat{y})x_{i}

Spreadsheet Implementation

This may be a little bit difficult to grasp. I have therefore summarized in pseudo-code and have provided an interactive spreadsheet example that can be downloaded here:

Gradient descent

#gradient descent
initialize_weight_vector
select_learning_rate
Repeat the following until an approximate minimum is obtained
  calculate_yhat
      dotproduct(inputs, weight_vector)
  gradient_rule
      dotproduct(inputs.T, y_hat-y)
  new_weight
      new_weight_vector = old_weight - learning_rate*gradient_rule
  weight_vector = new_weight_vector

Stochastic gradient descent

#stochastic gradient descent
initialize_weight_vector
select_learning_rate
Repeat the following until an approximate minimum is obtained:
   Randomly shuffle samples
   Select example
   compute yhat for single example
       dotproduct(input(i), weight_vector)
   compute gradient for single example
       dotproduct(input(i), y_hat(i)-y(i))
   new_weight
       new_weight_vector = old_weight - learning_rate*gradient_rule
  weight_vector = new_weight_vector

The stochastic gradient descent therefore calculates an approximation of the gradient at each step, where as standard gradient descent calculates the gradient of the full data set at each step.

Summary

We can summarize the three key differences between gradient descent and stochastic gradient descent as:

In standard gradient descent the error is summed over all examples before a weight can be updated. In stochastic gradient descent, the data set is shuffled and updates are made upon examining a training example.

Standard gradient decent is more computationally expensive.

Stochastic gradient decent has a better chance of finding the global minimum when there is more than one minimum point in the solution space. Let us visualize this using the excel example, we plot error vs weight parameter for each step:

In the above figures, we can see that SGD is more flexible when exploring the solution space. SGD is not ideal for our linear simple example but will however be much more capable at solving complex problems.

They are many more modifications of gradient descent. There are also no solid principles to guide on choices. In most papers I have read, guidance is received simply by using what worked for others when trying to solve a similar problem.

Further Reading

For more on gradient decent have a look through the below recommended material:

Jeremy Howard’s fastai explanation
Andrew Ng watch video here
Text book: Mitchell, Tom. (1997). Machine Learning, McGraw Hill. pp.93-95
10 Gradient Descent Optimization Algorithms and Cheat Sheet blog post here

Conclusion

The main purpose of this blog post was to motivate why modifications of gradient descent optimization algorithms are fundamental to neural networks. More specifically for solving complex problems. I hope this has been useful.

This concludes the theoretical part of Neural Networks Learning The Basics series. I will be introducing some practical examples in the next post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s