What is dimensionality reduction?
Dimensionality reduction is the act of reducing the number of input variables in the training data for machine learning models.
Data with high dimensionality may have hundreds, thousands, or millions of input variables. Data with fewer input variables or dimensions could be handled by machine learning models that have a simpler structure and fewer parameters (degrees of freedom).
Reducing the dimensionality of data with neural networks with simple models tend to generalize well, making them more desirable. If a machine learning model has too many degrees of freedom, it would tend to overfit the training dataset. Such a model would not perform very well on new data.
What is the importance of dimensionality reduction?
Performing learning on high dimensional data can have very high computational costs. Such high dimensional datasets do not tend to be randomly distributed and usually have spurious correlations. Models that are training on such data tend to perform rather well on the training data, but do not hold up against test data.
Dimensionality reduction gets rid of the irrelevant features (or variables) from the data that would otherwise make the model less accurate. It helps your predictive models avoid the ‘Curse of Dimensionality’, while preserving most of the relevant features and information in the data.
Reducing the dimensions of the data also makes data visualization easier and saves on time as well as storage space.
Dimensionality reduction can also help enhance the interpretation of the parameters of your machine learning model by eliminating multi-collinearity.
What is ‘Curse of Dimensionality’?
If the number of variables or features is much higher than the number of observations in a dataset, some algorithms would not be able to train very effective models.
This phenomenon is known as the ‘Curse of Dimensionality’.
What are the components of dimensionality reduction?
The components of dimensionality reduction are:
Feature selection
This involves trying to identify a subset of the original features in an attempt to find a smaller subset which we can use to model the problem.
The types of feature selection methods are:
- Filter
- Wrapped
- Embedded
Feature extraction
Feature extraction involves reducing the original set of raw data into manageable groups for the purpose of processing. It is carried out most often with text and image data, and the most important features are extracted and processed, instead of processing the entire dataset.
What are the ways of reducing dimensionality?
Here are some dimensionality reduction methods:
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Generalized Discriminant Analysis (GDA)
Principal Component Analysis (PCA)
This is a linear dimensionality reduction technique which turns a set of correlated features ‘p’ into a smaller number of uncorrelated features ‘k’ (k<p). These fewer uncorrelated variables are known as principal components. The method retains as much variation from the original dataset as possible. It is an unsupervised machine learning algorithm.
Linear Discriminant Analysis (LDA)
This technique separates the training instances by their classes. It identifies a linear combination of input variables, thus optimizing class separability. It is a supervised machine learning algorithm.
Generalized Discriminant Analysis (GDA)
This method uses a function kernel operator. It maps the input vectors into high-dimensional feature space. The method seeks to find projection for the variables into a lower dimensional space through the maximization of the ratio of between-class scatter to within-class scatter.
There are many other dimensionality reduction methods. These include t-distributed Stochastic Neighbor Embedding (t-SNE), Kernel PCA, Factor Analysis (FA), Truncated Singular Value Decomposition (SVD), Multidimensional Scaling (MDS), Isometric mapping (Isomap), Backward Elimination, Forward Selection, etc.
What are the types of feature selection for dimensionality reduction?
The types of feature selection for dimensionality are:
- Recursive Feature Elimination
- Genetic Feature Selection
- Sequential Forward Selection
Recursive Feature Elimination
Recursive Feature Elimination (RFE) is a wrapper-style feature selection algorithm. What this means is that a different machine learning algorithm used at the core of this method is essentially wrapped by RFE and is used to aid in selecting features.
Technically, its a wrapper-type feature selection algorithm which also makes use of filter-based feature selection internally. It works by looking for a subset of features by starting with the features in the training dataset and then eliminating features till the desired number remains.
Genetic Feature Selection
Genetic algorithms (GA) are inspired by Charles Darwin’s theory of natural selection in which only the fittest individuals are preserved over different generations. They mimic the forces of natural selection to find the optimal values of a function.
Since variables work in groups, for every possible solution of the genetic algorithm, the selected variables are considered as a whole. The algorithm won’t rank variables individually against the target.
Sequential Forward Selection
In sequential forward selection, first the best single feature is selected. After that, it forms pairs of features by making use of one of the remaining features, along with this best feature and then the best pair is selected. Next, you’d see triplets of features being formed, using this pair of best features, along with one of the remaining features.
This could continue till a predefined number of features is selected.
Is dimensionality reduction reversible?
Dimensionality reduction is reversible in autoencoders. These are essentially regular neural networks with a bottleneck layer in the middle. For example, you could have 20 inputs in the first layer, 10 neurons in the middle layer, and another 20 neurons in the last layer. On training such a network, you essentially force it to compress information to 10 neurons and then to decompress it, minimizing the error in the last layer.
If you use backpropagation to train the network, it will perform Principal Component Analysis (PCA), which returns uncorrelated features and is not very effective. Making use of a more sophisticated algorithm to train the autoencoder would cause it to perform Independent Component Analysis (ICA), which returns statistically independent features. This training algorithm looks for low complexity neural networks with high generalization capability. ICA is essentially a byproduct of regularization.