What is vanishing gradient problem in RNN?
The vanishing gradient problem is a problem that you face when you are training Neural Networks by using gradient-based methods like backpropagation. This problem makes it difficult to learn and tune the parameters of the earlier layers in the network.
This is one type of unstable behavior that you could encounter when you are training a deep neural network.
The vanishing gradient problem is essentially a situation in which a deep multilayer feed-forward network or a recurrent neural network (RNN) does not have the ability to propagate useful gradient information from the output end of the model back to the layers near the input end of the model.
It results in models with many layers being rendered unable to learn on a specific dataset. It could even cause models with many layers to prematurely converge to a substandard solution.
When the backpropagation algorithm advances downwards(or backward) going from the output layer to the input layer, the gradients tend to shrink, becoming smaller and smaller till they approach zero. This ends up leaving the weights of the initial or lower layers practically unchanged. In this situation, the gradient descent does not ever end up converging to the optimum.
Vanishing gradient does not necessarily imply that the gradient vector is all zero (with the exception of numerical overflow). It implies that the gradients are minuscule, which would cause the learning to be very slow.

How do you know if your model is suffering from the vanishing gradient problem?
Here are some signs that are indicators of your problem suffering from the vanishing gradient problem:
- The parameters of the higher layers change to a great extent, while the parameters of lower layers barely change (or, do not change at all).
- The model weights could become 0 during training.
- The model learns at a particularly slow pace and the training could stagnate at a very early phase after only a few iterations.
What causes the vanishing gradient problem?
The vanishing gradient problem is caused by the fact that while the process of backpropagation goes on, the gradient of the early layers (the layers that are nearest to the input layer are derived by multiplying the gradients of the later layers (the layers that are near the output layer). Therefore, if the gradients of later layers are less than one, their multiplication vanishes at a particularly rapid pace.
Why is the vanishing gradient problem significant?
The vanishing gradient problem causes the gradients to shrink. But, if a gradient is small, it won’t be possible to effectively update the weights and biases of the initial layers with each training session.
These initial layers are vital for recognizing the core elements of the input data, so, if their weights and biases are not properly updated, it is possible that the entire network could be inaccurate.
How do you overcome the vanishing gradient problem?
Here are some methods that are proposed to overcome the vanishing gradient problem:
- Residual neural networks (ResNets)
- Multi-level hierarchy
- Long short term memory (LSTM)
- Faster hardware
- ReLU
- Batch normalization
Residual neural networks (ResNets)
This is one of the most effective techniques that you can use to overcome the vanishing gradient problem. Before ResNets, a deeper network would have a higher degree of training error than a shallow network.
A backpropagation algorithm each weight of the neural network in a manner that causes it to take a step in the direction along which the loss decreases. This direction is the gradient of the weight (concerning the loss).
You can use the chain rule to find this gradient for each weight. You can find it by multiplying the local gradient by the gradient flowing from ahead.
As the gradient flows backward to the initial layers, this value goes on getting multiplied by each local gradient. This causes the gradient to keep shrinking, causing the updates to the initial layers to be rather minor, thus increasing the training time substantially. This problem could be solved if the local gradient managed to become 1.
This can be achieved by using the identity function as its derivative would always be 1. So, the gradient would not decrease in value because the local gradient is 1.
The ResNet architecture does not allow the vanishing gradient problem to occur. It has skip connections that function as gradient superhighways which allow the gradient to flow unhindered. It makes it possible for gradients to propagate to deep layers before they can are attenuated to small or zero values.
ReLU
Here, we replace the typical sigmoidal activation functions used for node output with with a new function: f(x) = max(0, x). This activation only saturates in one direction and thus it is more resilient to the problem of vanishing of gradients.
ReLU is widely used because it maps x to max(0,x) and does not squash the inputs to a small range though it squashes all the negative inputs to zero.
However, ReLU is not the best option for the intermediate layers of the network in certain cases. There is the problem of dying ReLUs in which some neurons just end up dying out. This means that they end up throwing 0 as outputs with the advancement in training.
Some alternative functions of the ReLU that overcome the vanishing gradient problem when used as activation for the intermediate layers of the network include LReLU, PReLU, ELU, SELU.
Multi-level hierarchy
Multi-level hierarchy involves pre-training a single layer at a time and then performing backpropagation for fine-tuning.
Long Short Term Memory
Long Short Term Memory (LSTM) was created specifically for the purpose of preventing the vanishing gradient problem. It manages to do that with the Constant Error Carousel (CEC). However, even in an LSTM, the gradients do tend to vanish; they just vanish at a far slower pace than they do in regular recurring neural networks.
Batch normalization
The internal covariate shift problem is somewhat involved in the exploding gradient problem. Batch normalization can solve this problem. It involves normalizing the activations of each layer, making it possible for every layer to learn on a more stable distribution of inputs, thereby accelerating the training of the network.