What is stochastic gradient descent?
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) with an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems, this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.
Error functions are rarely as simple as a typical parabola. Most often they have lots of hills and valleys, like the function pictured here. In this graph, if true gradient descent started on the left side of the graph, it would stop at the left valley because no matter which direction you travel from this point, you must travel upwards. This point is known as a local minimum. However, there exists another point in the graph that is lower. The lowest point in the entire graph is the global minimum, which is what stochastic gradient descent attempts to find.
What is the difference between Batch Gradient Descent and Stochastic Gradient Descent?
Gradient, in plain terms means slope or slant of a surface. So gradient descent literally means descending a slope to reach the lowest point on that surface. The objective of gradient descent is to minimize the sum of squared residuals. It is an iterative algorithm, that starts from a random point on a function and travels down its slope in steps until it reaches the lowest point of that function.
This algorithm is useful in cases where the optimal points cannot be found by equating the slope of the function to 0.
However, there are a few downsides to the gradient descent algorithm. Say we have 10,000 data points and 10 features. The sum of squared residuals consists of as many terms as there are data points, so 10000 terms in our case. We need to compute the derivative of this function with respect to each of the features, so in effect, we will be doing 10000 * 10 = 100,000 computations per iteration. It is common to take 1000 iterations, in effect, we have 100,000 * 1000 = 100000000 computations to complete the algorithm. That is pretty much overhead and hence gradient descent is slow on huge data.
Stochastic Gradient Descent
Stochastic gradient descent tries to solve the main problem in Gradient descent which is the usage of whole training data to calculate gradients as each step. SGD is stochastic in nature i.e it picks up a “random” instance of training data at each step and then computes the gradient making it much faster as there are much fewer data to manipulate at a single time, unlike Batch GD.
There is a downside to the Stochastic nature of stochastic gradient descent. Once it reaches close to the minimum value, it doesn’t settle down, instead bounces around which gives us a good value for model parameters but is not optimal which can be solved by reducing the learning rate at each step which can reduce the bouncing and SGD might settle down at global minimum after some time.
What are the advantages of stochastic gradient descent?
Suppose you have made an app and want to improve it by taking feedback from 100 customers. You can do it in two ways. First, you can give the app to the first customer and take his feedback then to the second one, then the third, and so on. After collecting feedback from all of them you can improve your app. But you can also improve the app as soon as you get feedback from the first customer. Then you give it to the second one and you improve again before giving it to the third one. Notice that in this way you are improving your app at a much faster rate and can reach an optimal point much earlier.
Hopefully, you can tell that the first process is the Vanilla Gradient Descent and the second one is SGD.
What are the disadvantages of stochastic gradient descent?
SGD is much faster but the convergence path of SGD is noisier than that of original gradient descent. This is because in each step it is not calculating the actual gradient but an approximation. So we see a lot of fluctuations in the cost. But still, it is a much better choice.
We can see the noise of SGD in the above contour plot. It is to be noted that vanilla GD takes a fewer number of updates but each update is done actually after one whole epoch. SGD takes a lot of update steps but it will take a lesser number of epochs i.e. the number of times we iterate through all examples will be lesser in this case and thus it is a much faster process.
As you can see in the plot there is a third variant of gradient descent known as Mini-batch gradient descent. This is a process that uses the flexibility of SGD and the accuracy of GD. In this case, we take a fixed number(known as batch size) of training examples at a time and compute the cost and corresponding gradient. Then we update the weights and continue the same process for the next batch. If batch size = 1 then it becomes SGD and if batch size = m then it becomes normal GD.