What are hyperparameters in deep learning?
In a deep learning neural network, hyperparameters are the variables that determine the network structure as well as how the network is trained. These hyperparameters are set before the network is trained before the weights and bias are optimized.
It is not possible to know the best value for a model hyperparameter for a particular problem. While initially setting hyperparameters, rules of thumb may be used or values of hyperparameters used in other similar problems may be used.
You may even hunt for the best value through trial and error. When you are tuning a machine learning algorithm for a particular problem, then you happen to be tuning the model hyperparameters or order to discover the parameters of the model that would result in the most accurate predictions possible.
What are examples of hyperparameters?
Here are some examples of hyperparameters split into two categories:
Hyperparameters related to Network structure
Hidden layers are the layers between the input layer and the output layer. It involves adding layers till the test layer does not improve any further.
A large number of hidden units within a layer with regularization techniques can improve the accuracy of the network.
Dropout is a regularization technique that is employed for the purpose of preventing overfitting (increasing the validation accuracy). Dropout is effective in increasing the generalization power.
A small dropout value would often be used, usually starting at 20% of neurons and then it might increase to 50% of neurons.
When a larger network is used, it is likely for you to get a better result from dropout. Using dropout on a larger network gives the model a greater opportunity to learn independent representations.
Network Weight Initialization
It could be useful to make use of different weight initialization schemes according to the activation function that is employed on each layer. In most cases, uniform distribution is used.
Activation functions are used for the purpose of introducing non-linearity to models. This empowers deep learning models to learn nonlinear prediction boundaries.
The rectifier activation function is used most widely. The Sigmoid activation function is employed in the output layer while binary predictions are being made. The Softmax activation function is utilized in the output layer while making multi-class predictions.
Hyperparameters related to Training Algorithm
The learning rate defines the speed at which a network updates its parameters. With a low learning rate, the learning process slows down, but it converges smoothly. A higher learning rate speeds up the learning, but may not converge.
A decaying learning rate is generally preferred.
Momentum uses knowledge of the preceding steps to know the direction of the next step. It helps in avoiding oscillations. A momentum between 0.5 and 0.9 is generally used.
Number of epochs
The number of epochs is the number of times during training that the whole training data is shown to the network.
The number of epochs should be decreased till the validation accuracy begins to decrease, even if the training accuracy is increasing(overfitting).
Mini batch size refers to the number of sub samples provided to the network after which parameter update happens.
32 is a good default batch size to use, although you could aslo consider 64, 128, and other batch sizes.
What is the difference between a parameter and a hyperparameter?
A parameter (also known as a model parameter) is a configuration variable that is internal to the model. It is possible to estimate its value from the data. Parameters are required by the model in order for it to make predictions and usually are saved as part of the model. A hyperparameter (also known as a model hyperparameter) is external to the model and it is not possible to estimate its value from the data.
Parameters are usually learned from data and are not usually set manually by a practitioner. Hyperparameters, on the other hand, are often set manually by a practitioner. It is possible to set hyperparameters by using heuristics.
While the values of parameters are used to define the skill of the model on a problem, hyperparameters are usually used in processes to aid in estimating the model parameters.
Why Hyperparameter tuning is important?
Hyperparameter tuning is the problem of choosing the optimal set of hyperparameters to be used in a learning algorithm. As discussed earlier, you can’t know the optimal value for the hyperparameter. But, since hyperparameters essentially control the overall behaviour of a machine learning model, it’s is critical for you to find the optimal value for the hyperparameters.
Hyperparameter tuning is of extreme importance. The goal is to find the right combination of hyperparameters to minimize a predefined loss function and therefore get better results from the model.
If hyperparameter tuning is not done correctly, the model would not converge and minimize the loss function effectively. That would cause the model to generate sub-optimal results.
What are the methods to find hyperparameters?
The methods used to find hyperparameters are:
- Manual search
- Grid search
- Random search
- Bayesian Optimization
How to do hyperparameter tuning?
Grid search is the most basic hyperparameter tuning method. It involves building a model for every single one of the hyperparameter models provided. After that, every model is evaluated and the architecture that produces the best results is selected.
In random search, a discreet set of values to explore for every hyperparameter is not provided. Instead, a statistical distribution is provided for each hyperparameter from which values may be randomly sampled.
One of the most prominent reasons to use random search over grid search is that in most cases, not all hyperparameters are equally important. In fact, for most datasets, only a few of the hyperparameters are really important, however, these are different hyperparameters that are important for different datasets.
Bayesian optimization belongs to a class of sequential model-based optimization (SMBO) algorithms which make it possible to use the results of our previous iteration to improve the sampling method used in the next experiment.
We start by defining a model constructed with hyperparameters λ which, post training, is scored v according to an evaluation metric. After that, the previously evaluated hyperparameter values is used for the purpose of computing a posterior expectation of the hyperparameter space.The optimal hyperparameter values can be chosen according to the posterior expectation as our next model candidate.
The process is repeated iteratively until converging to an optimum.