If you have any interest in the field of machine learning and artificial intelligence (AI), then you must have come across neural networks. I like to think of a neural network as a computer program that learns how to accomplish specific tasks with the help of lots and lots of data. Probably a very common definition. If you give a neural network data on every single possible scenario required to solve a particular problem enough times, it will become close to perfect at solving that problem. Some have even implemented neural networks that have solved problems better than the human. Sounds too good to be true? Many things about neural networks sound too good to be true.
After knowing this, you might have tried plugging a standard neural network to your problem and got not so great results. One must know that gradient descent (the numerical optimization algorithm neural networks are based on) does not guarantee convergence. It also does not guarantee that a global minimum will be reached. It is therefore useful to understand how tuning hyperparameters can improve the performance of a neural network in order to guide convergence. Ok so by the time I got to gradient decent I might have lost you. Check out more about gradient descent here. Understanding tuning hyperparameters will help you better debug your model as well as understand its limitations.
I created a neural network spreadsheet playground to better help improve my understanding of the learning process and tuning hyperparameters. I hope you will also find it useful.
In this post I will walk-trough some examples that illustrate how the spreadsheet can be used to understand some fundamental concepts behind tuning hyperparameters without getting into the mathematics. If you are however interested in the mathematics check out my post on backpropagation.
The neural network can be downloaded here:
The configuration of the neural network is the one below. It has been adapted from, here.
Example 1: Learning Rate
Let us go through an example in which we simulate different learning rates:
In this example we explore the relationship between the training loss function and the learning rate producing the plot below.
Here we see that increasing the learning rate beyond 0.5 increased the loss/error.
And then what?
Ideally one would want a learning rate that reduces the loss (taking over-fitting into account). When we explore lower learning rates we get the below plot. Here we see a learning rate between 0.2 and 0.3 produces a minimum training loss given all other parameters remain constant.
Example 2: The bias
Let us have a look at how the bias and impact the training loss.
Here I apply conditional formatting to the output. We can see that the output layer bias, has more impact on the training loss. We know this because the colours are changing from left to right in favour of .
Example 3: Weights
What happens to the neural network if we use higher initial weights? Here we look at the impact relative to each epoch.
The outputs are as below:
We can see that in each of scenarios above the neural network still arrives to a minimum training loss. What if we perform the same analysis using a lower learning rate? Let us set the learning rate to 0.01.
In the above we notice that there is no convergence. We will have to train the neural network longer, that is increase the number of steps in order for the network to converge. This is because the learning rate determines how much gradients should be adjusted. If the learning rate is too high we overshoot the minimum. If the learning rate is too low we may never reach convergence.
So what weights should one use?
McGrawHill – Machine Learning text book recommends to initialise the weights between -.05 and .05.
Keras a neural network framework includes different weight initialization procedures. You can read more about them here.
I guess the next question will be how does one then select what method to use?
Unfortunately I do not have a direct answer. The best method is to conduct research relative to the problem you are trying to solve and see what methods have worked for others. This will give you an initial baseline. From then on you can carry out experiments that try to improve on the baseline. The same can be said for the other tuning hyperparameters.
Example 4: Scaling The Data
When you scale your data, you are basically applying boundaries to it. In my previous post I discussed activation functions, and briefly mentioned that caution needs to be taken when we apply an activation function to the output layer. Let us run visualize this concept.
Looking at the training loss plot, the training process looks pretty normal? however the numbers on the y axis are inflated. Theoretically your loss should be heading towards zero.
The thing here is that you may misinterpret the performance of your neural net. Looking at the output values you get very small numbers like the data sample below.
Imagine applying a test set to this data and getting an error of 82, when actually the error is 0.008. This will give you the impression that you have produced a very bad model, when the reality isn’t true.
Training neural networks is quite challenging. The gradient descent (the numerical optimization algorithm neural networks are based on) does not guarantee convergence.
There is no formula to guarantee that (1) the network will converge to a good solution, (2) convergence is swift, or (3) convergence even occurs at all.Neural Networks: Tricks of the Trade, 2012
The training task is therefore to configure, test, and tune hyperparameters to address this challenge. I have demonstrated how one can use this spreadsheet to understand some theory and fundamental concepts. You can take it further to understand other concepts such as over-fitting, regularization, linearity and non-linearity.
Let me know if there is any information I might have missed and other concepts that you might find interesting for me to develop next.