Imbalanced Dataset

AI & Machine Learning is disrupting the world with a storm and changing the landscape of technology every day. And we know that both the technologies thrive on data, more data, and loads of data. The term Classification comes into the picture that has to do with teaching ML how to process and work around the data. Classification is a process where computer groups or divides the data based on pre-determined characteristics to yield results or reach a conclusion. Generally, classification distributes the observations in equal class labels, but sometimes these observations are skewed or have unequal class labels.

What is an Imbalanced Dataset?

Imbalanced Dataset “also called an unbalanced dataset ” refer to datasets where the observations or target class is skewed or has uneven distribution. This means that one class label has a very fewer number of observations and the other has very high numbers of observations.

For example, a company is doing a health check-up on their website to understand the bounce rate. After the analysis, they found out that only 800 people have left the website without any engagement out of 50,000 new visitors this month. That implies that the website has a 1.6% bounce rate which is quite less compared to the total footfall.

If we distribute these observations into 2 target classes, one as "Lost Traffic" and the other as "Potential Traffic" the observation in "lost traffic" would be called the minority class, and observations in "potential traffic" would be called the majority class. And this classification problem would fall under the category of Imbalanced Dataset.

The dataset imbalance is common in classification problems. But in some cases, that imbalance is critical and takes a lot of time to balance or analyze such parity.

What is the difference between an balanced dataset & imbalanced dataset?

Balanced dataset

Under classification problem, if the observations in the provided target classes are uniform or equal it’s called a balanced dataset. A balanced dataset can be analysed and processed easily as they don’t need modification or changes as such to derive the final results or base a hypothesis.

Let's take a basic example that we have a data set that shows the number of admissions in a public university for an MBA course over 3 years. And the count of students enrolled every year was more or less the same given the intake capacity of the college.

Imbalanced dataset

If the observation values are completely different or unbiased in the target classes, then that data set is called an Imbalanced dataset. In order to process or feed in the data for ML, it goes through a certain process to balance the target classes depending on the characteristics & other attributes. The imbalance datasets are common and don’t impact much on classification or predictive modeling as there’s always a chance of data imbalance in real datasets.

Another easy example of an Imbalanced dataset is when a company tries to differentiate ‘ham-mails’ from ‘spam-mails’ through spam email filtering services.

The data imbalance is categorised into 3 levels on observations or proposition of minority class to the majority class.

Degree of Imbalance	The proportion of Minority Class
Mild	20-40% of the data set
Moderate	1-20% of the data set
Extreme	<1% of the data set

How do you balance an Imbalanced Dataset?

Use the right evaluation metrics

Not using the right evaluation metric can further complicate the classification and can be time-consuming at the same time. Not necessarily that every metric would give the same result, but one can be more impactful than the other in a lot of problems depending on the degree of imbalance.

If the degree of imbalance between sample classes is not high, the classification accuracy metric can be used to train the data.

For example, there are 3 data sets as A, B, and C, and the sample proportion in each class is 30%, 35%, and 35% accordingly: in this scenario, the imbalance is not high or severe the classification accuracy metric would be ideal to use. On the other hand, if the degree of imbalance is moderate or high, other metrics like Logarithmic, F1-score, precision, recall, and confusion matrix can be used.

Resampling

It is one of the most used metrics to stabilize the balance between minority class and majority class. Under this approach samples from each class are modified to create balanced datasets.

Undersampling

Under this techniques data sets are being altered creating the balance by taking down the samples from the majority class or abundant class and increasing the size of the rare class.

Let’s take a look at it this example: there are 2 data classes with 3K and 7k data points or samples with a total of 10k data points. In order to create a balance between these classes, we will take 3K data points from the 7K class (abundant class) and take the entire 3K data points from the 3K class (rare class) on the basis of common attributes and properties.

There’s another way to perform undersampling by using Tomek Links in Python. Tomek Links identify the close opposite instances (case or observation) in the majority class and removes them to create an equivalence with the minority class (In the tech world, they use the term ‘Noise Reduction’ to address the same process). In simple words, each sample in the data class is classified and given different labels that have unique properties which allow comparison between different datasets. If these labels are less polluted or unique it becomes easier for the system to conduct the data training. Therefore, the Tomek Link method removes class labels, that are polluted to improve the data quality & accuracy.

Oversampling

Continuing with the same example, under over sample techniques we increase the size of minority/rare classes to create a balance if data is insufficient.

For example, we take the rare class with 3k data samples points and replicate the samples by 2.33%. This will bring in the balance between both rare and abundance classes, and the total data size would be 14K.

We can use SMOTE Synthetic Minority Over-sampling Technique in python to perform oversampling for imbalanced datasets.

‍

K-fold Cross-Validation

It's a resampling technique where the entire data set is been divided into groups under a single parameter referred to as 'K'. The procedure is also called k-fold cross-validation. A specific value for k is been chosen that gets used in place of k in the reference to the model. for example, k=7 functions as 7-fold cross-validation. Under this technique, the datasets get shuffled and split into different data groups in order to conduct the data training or balance the classification to yield fair results.

Ensemble learning

Generally, we conduct or compare rare & abundant classes by 1:1 ratio for generalizing a model. Under Ensemble learning or Ensemble different resampled datasets technique samples or data points in the majority class are divided/changed into multiple data sets and aligned with minority class; based on these ratios system creates a test sample to conduct the analysis.

For example, the majority class has 10k data points, and the rare class has 2k data points. The data samples in the majority class would be changed/divided into 5 different data samples carrying 2k data points in each class. And each of these samples from the majority class would be put across with 2k data points from the rare class to generate a new sample. so, instead of making an equation of 5:1 (abundant: rare), there are 5 more sample sets are being created of 1:1.