What is regularization in machine learning?
In machine learning, regularization is the process that regularizes coefficients, shrinking them towards zero, thus forcing you to avoid learning a more complex or flexible model. It helps improve the reliability, speed, and accuracy of convergence.
It makes the model simpler, preventing learning a complex model in an attempt to avoid overfitting. It prevents the model from overfitting by adding extra information to it.
Regularization is essentially a technique that slightly modifies the learning algorithm to cause the model to generalize in a more effective manner. It even helps the model perform better on unseen data.
Regularization penalizes complex models by adding a complexity term that causes an even greater loss for complex models. In machine learning, regularization penalizes the coefficients, but in deep learning, it imposes penalties on the weight matrices of the nodes.
Does regularization increase bias?
Regularization aims to reduce the variance of the estimator by virtue of simplifying it. This will increase the bias in a manner that will cause the expected error to rise. This is usually done when the problem is not posed well, like situations where the number of parameters is more than the number of samples.
However, when regularization is done right, it ensures that it introduces just the right amount of bias to avoid overfitting, no more and no less.
What is overfitting?
The whole point of regularization is to prevent overfitting. But what is overfitting and why should you avoid it?
Overfitting is a situation where the model ends up modeling the training data way too well. Now the model learns not only the detail in the training data, but it also learns the noise in the training data as concepts.
The problem here is that when overfitting occurs, the model pretty much loses its ability to generalize. This means that the model becomes relevant solely to the dataset on which it was trained, and cannot be used on any other datasets.
What is the use of regularization in machine learning?
Regularization causes a significant reduction in the variance of the model, without a large increase in its bias.
The tuning parameter λ has control over the impact on bias and variance. The higher the value of λ rises, the lower the value of the coefficients fall, causing the variance to be reduced. A rising λ is good and useful, until a certain point. That is because it is reducing variance and preventing overfitting without causing the loss of any important properties in the data. But after it crosses a certain threshold it starts to lose vital properties, introducing bias in the model and thus causing underfitting. Now the model isn’t learning the noise in the training data as it would in overfitting, but it isn’t even able to model the training data or generalize new data.
You need to pick the value of λ with great care.
How does Regularization Work?
Regularization functions by adding a penalty (a complexity term or a shrinkage term) with Residual Sum of Squares (RSS) to the complex model.
Take the simple linear regression equation. In this, Y signifies the learned relation (this is the dependent feature or response).
Y is approximated to β0 + β1X1 + β2X2 + …+ βpXp
X1, X2, …Xp are independent features or predictors for Y, and β0, β1,…..βn signifies the coefficients estimates for different variables or predictors(X) which describe the weights (the magnitude) attached to the features.
The fitting procedure comprises a loss function, the residual sum of squares (RSS) function. The coefficients are picked in such a manner that they minimize the loss function.
The coefficients will be adjusted based on your training data. If the training data has noise in it, you’ll find that the estimated coefficients will not generalize very well to the future data. That’s where regularization comes into play, shrinking and regularizing those learned estimates down towards zero.
What are the types of regularization?
Here are the types of regularization:
In dropout, a random number of activations, which leads to smaller networks that can be trained more effectively. Activations are the outputs that you get when you multiply the inputs with the weights. If a particular part of activations is removed at every layer, then no specific activation would learn the input model.
Now, there won’t be any specific activation on which you would put extra weight since you have no idea whether that activation would sustain or not. This means that there will not be any overfitting regarding the input model.
Batch normalization manages to normalize the output of a previous activation layer by way of subtracting the batch mean and dividing by the batch standard deviation. It introduces two trainable parameters to each layer so that the normalized output gets multiplied by gamma and beta. The values of gamma and beta will be found with the neural network strategy.
It increases the learning rate, improves accuracy, and solves the problem of covariance shift by weakening the coupling between the initial layers parameters and later layers parameters, so changes to the layers will be independent of each other and the learning process of the network will get hastened.
Data augmentation involves adding slightly edited versions of existing data or using existing data to create synthetic data, thus increasing the actual amount of data available.
It helps deep learning models become more robust and precise by generating variations of the data that the model could encounter in the real world.
Here, you use one part of the training set as a validation set, and the performance of the model is gauged against this set. if the performance on this validation set worsens, the training on the model is stopped instantly.
A regression model using L2 regularization is called Ridge Regression. In Ridge regression, the squared magnitude of the coefficient is added as the penalty term to the loss function.
A regression model making use of L1 regularization technique is called Lasso Regression. Lasso stands for Least Absolute Shrinkage and Selection Operator and adds the “absolute value of magnitude” of coefficient as penalty term to the loss function.