In the previous post we looked at backpropagation. In the example we calculated the gradient of the loss with respect to each weight parameter. The gradient determined which direction to move the weight parameter and how much to move it by. The new weight parameter was then updated using the formula below:
where is the error and is the scaling constant known as the learning rate.
The update is called the gradient decent.
Each update made is called a step and the hope is that with each step the model gets closer and closer to the global minimum. The global minimum meaning weight parameters that best minimize the loss function there by, giving the best estimate of the target variable. In each step the weight parameters are altered in the direction of the steepest decent. How can we calculate the direction of the steepest decent? Here we need to add a gradient rule. The gradient rule which we applied in our previous post is the delta rule or standard gradient descent. Recall for a one training example problem:
where is the target variable and is the output obtained from forward propagation.
If we now have training examples and multidimensional weights
where is the single input component from training set
Therefore, the weight update rule for gradient descent becomes
Note for simplicity we do not include an activation function in our example. This is therefore gradient descent implementation for linear units.
The spreadsheet implementation of gradient descent with training examples and multidimensional parameters can be downloaded here:
It is worthwhile going through this example and trying to replicate it. Only then will you truly understand how computationally expensive this process is and get an appreciation of other gradient descent modification methods. I have validated this output by replicating it in python. You can have a look at the python implementation here.
Why The Need For Modification Algorithms?
If we have a solution space like the figure below with one global minimum the gradient descent algorithm defined above will converge to a weight vector with minimum error, given that a sufficient learning rate is used. If is to large the algorithm may over shoot the minimum. Therefore some modification to the algorithm are made to gradually reduce such as multiplying the gradient with
What if they are multiple local minimum? Here, there is no guarantee that the method will find the global minimum. One common variation to gradient descent intended to overcome this is called stochastic gradient descent also known as incremental gradient decent.
Stochastic Gradient Descent
Gradient descent computes the dot product of all inputs in the training set and their respective . This also means that for all must also be determined before the weights are updated. These operations will become computationally expensive when done over a large data set. Stochastic gradient descent is a modification in that weights are updated incrementally. Here, the algorithm selects a random data point and computes the gradient of that data point and not the full training set. That is:
The summation sign falls off and we have:
This may be a little bit difficult to grasp. I have therefore summarized in pseudo-code and have provided an interactive spreadsheet example that can be downloaded here:
#gradient descent initialize_weight_vector select_learning_rate Repeat the following until an approximate minimum is obtained calculate_yhat dotproduct(inputs, weight_vector) gradient_rule dotproduct(inputs.T, y_hat-y) new_weight new_weight_vector = old_weight - learning_rate*gradient_rule weight_vector = new_weight_vector
Stochastic gradient descent
#stochastic gradient descent initialize_weight_vector select_learning_rate Repeat the following until an approximate minimum is obtained: Randomly shuffle samples Select example compute yhat for single example dotproduct(input(i), weight_vector) compute gradient for single example dotproduct(input(i), y_hat(i)-y(i)) new_weight new_weight_vector = old_weight - learning_rate*gradient_rule weight_vector = new_weight_vector
The stochastic gradient descent therefore calculates an approximation of the gradient at each step, where as standard gradient descent calculates the gradient of the full data set at each step.
We can summarize the three key differences between gradient descent and stochastic gradient descent as:
In standard gradient descent the error is summed over all examples before a weight can be updated. In stochastic gradient descent, the data set is shuffled and updates are made upon examining a training example.
Standard gradient decent is more computationally expensive.
Stochastic gradient decent has a better chance of finding the global minimum when there is more than one minimum point in the solution space. Let us visualize this using the excel example, we plot error vs weight parameter for each step:
In the above figures, we can see that SGD is more flexible when exploring the solution space. SGD is not ideal for our linear simple example but will however be much more capable at solving complex problems.
They are many more modifications of gradient descent. There are also no solid principles to guide on choices. In most papers I have read, guidance is received simply by using what worked for others when trying to solve a similar problem.
For more on gradient decent have a look through the below recommended material:
Jeremy Howard’s fastai explanation
Andrew Ng watch video here
Text book: Mitchell, Tom. (1997). Machine Learning, McGraw Hill. pp.93-95
10 Gradient Descent Optimization Algorithms and Cheat Sheet blog post here
The main purpose of this blog post was to motivate why modifications of gradient descent optimization algorithms are fundamental to neural networks. More specifically for solving complex problems. I hope this has been useful.
This concludes the theoretical part of Neural Networks Learning The Basics series. I will be introducing some practical examples in the next post.