Random Oversampling and Undersampling for Imbalanced Classification

Author: Jason Brownlee

Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class.

This bias in the training dataset can influence many machine learning algorithms, leading some to ignore the minority class entirely. This is a problem as it is typically the minority class on which predictions are most important.

One approach to addressing the problem of class imbalance is to randomly resample the training dataset. The two main approaches to randomly resampling an imbalanced dataset are to delete examples from the majority class, called undersampling, and to duplicate examples from the minority class, called oversampling.

In this tutorial, you will discover random oversampling and undersampling for imbalanced classification

After completing this tutorial, you will know:

  • Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
  • Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
  • Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Random Oversampling and Undersampling for Imbalanced Classification

Random Oversampling and Undersampling for Imbalanced Classification
Photo by RichardBH, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. Random Resampling Imbalanced Datasets
  2. Imbalanced-Learn Library
  3. Random Oversampling Imbalanced Datasets
  4. Random Undersampling Imbalanced Datasets
  5. Combining Random Oversampling and Undersampling

Random Resampling Imbalanced Datasets

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.

This is a simple and effective strategy for imbalanced classification problems.

Applying re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem

A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

The simplest strategy is to choose examples for the transformed dataset randomly, called random resampling.

There are two main approaches to random resampling for imbalanced classification; they are oversampling and undersampling.

  • Random Oversampling: Randomly duplicate examples in the minority class.
  • Random Undersampling: Randomly delete examples in the majority class.

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

Both approaches can be repeated until the desired class distribution is achieved in the training dataset, such as an equal split across the classes.

They are referred to as “naive resampling” methods because they assume nothing about the data and no heuristics are used. This makes them simple to implement and fast to execute, which is desirable for very large and complex datasets.

Both techniques can be used for two-class (binary) classification problems and multi-class classification problems with one or more majority or minority classes.

Importantly, the change to the class distribution is only applied to the training dataset. The intent is to influence the fit of the models. The resampling is not applied to the test or holdout dataset used to evaluate the performance of a model.

Generally, these naive methods can be effective, although that depends on the specifics of the dataset and models involved.

Let’s take a closer look at each method and how to use them in practice.

Imbalanced-Learn Library

In these examples, we will use the implementations provided by the imbalanced-learn Python library, which can be installed via pip as follows:

sudo pip install imbalanced-learn

You can confirm that the installation was successful by printing the version of the installed library:

# check version number
import imblearn
print(imblearn.__version__)

Running the example will print the version number of the installed library; for example:

0.5.0

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Random Oversampling Imbalanced Datasets

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

Examples from the training dataset are selected randomly with replacement. This means that examples from the minority class can be chosen and added to the new “more balanced” training dataset multiple times; they are selected from the original training dataset, added to the new training dataset, and then returned or “replaced” in the original dataset, allowing them to be selected again.

This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model. This might include algorithms that iteratively learn coefficients, like artificial neural networks that use stochastic gradient descent. It can also affect models that seek good splits of the data, such as support vector machines and decision trees.

It might be useful to tune the target class distribution. In some cases, seeking a balanced distribution for a severely imbalanced dataset can cause affected algorithms to overfit the minority class, leading to increased generalization error. The effect can be better performance on the training dataset, but worse performance on the holdout or test dataset.

… the random oversampling may increase the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example.

— Page 83, Learning from Imbalanced Data Sets, 2018.

As such, to gain insight into the impact of the method, it is a good idea to monitor the performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

The increase in the number of examples for the minority class, especially if the class skew was severe, can also result in a marked increase in the computational cost when fitting the model, especially considering the model is seeing the same examples in the training dataset again and again.

… in random over-sampling, a random set of copies of minority class examples is added to the data. This may increase the likelihood of overfitting, specially for higher over-sampling rates. Moreover, it may decrease the classifier performance and increase the computational effort.

A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

Random oversampling can be implemented using the RandomOverSampler class.

The class can be defined and takes a sampling_strategy argument that can be set to “minority” to automatically balance the minority class with majority class or classes.

For example:

...
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')

This means that if the majority class had 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.

A floating point value can be specified to indicate the ratio of minority class majority examples in the transformed dataset. For example:

...
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy=0.5)

This would ensure that the minority class was oversampled to have half the number of examples as the majority class, for binary classification problems. This means that if the majority class had 1,000 examples and the minority class had 100, the transformed dataset would have 500 examples of the minority class.

The class is like a scikit-learn transform object in that it is fit on a dataset, then used to generate a new or transformed dataset. Unlike the scikit-learn transforms, it will change the number of examples in the dataset, not just the values (like a scaler) or number of features (like a projection).

For example, it can be fit and applied in one step by calling the fit_sample() function:

...
# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)

We can demonstrate this on a simple synthetic binary classification problem with a 1:100 class imbalance.

...
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

The complete example of defining the dataset and performing random oversampling to balance the class distribution is listed below.

# example of random oversampling to balance the class distribution
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# summarize class distribution
print(Counter(y))
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over))

Running the example first creates the dataset, then summarizes the class distribution. We can see that there are nearly 10K examples in the majority class and 100 examples in the minority class.

Then the random oversample transform is defined to balance the minority class, then fit and applied to the dataset. The class distribution for the transformed dataset is reported showing that now the minority class has the same number of examples as the majority class.

Counter({0: 9900, 1: 100})
Counter({0: 9900, 1: 9900})

This transform can be used as part of a Pipeline to ensure that it is only applied to the training dataset as part of each split in a k-fold cross validation.

A traditional scikit-learn Pipeline cannot be used; instead, a Pipeline from the imbalanced-learn library can be used. For example:

...
# pipeline
steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)

The example below provides a complete example of evaluating a decision tree on an imbalanced dataset with a 1:100 class distribution.

The model is evaluated using repeated 10-fold cross-validation with three repeats, and the oversampling is performed on the training dataset within each fold separately, ensuring that there is no data leakage as might occur if the oversampling was performed prior to the cross-validation.

# example of evaluating a decision tree with random oversampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# define pipeline
steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)
score = mean(scores)
print('F1 Score: %.3f' % score)

Running the example evaluates the decision tree model on the imbalanced dataset with oversampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm, rather than optimally solve the synthetic dataset.

The default oversampling strategy is used, which balances the minority classes with the majority class. The F1 score averaged across each fold and each repeat is reported.

Your specific results may differ given the stochastic nature of the dataset and the resampling strategy.

F1 Score: 0.990

Now that we are familiar with oversampling, let’s take a look at undersampling.

Random Undersampling Imbalanced Datasets

Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset.

This has the effect of reducing the number of examples in the majority class in the transformed version of the training dataset. This process can be repeated until the desired class distribution is achieved, such as an equal number of examples for each class.

This approach may be more suitable for those datasets where there is a class imbalance although a sufficient number of examples in the minority class, such a useful model can be fit.

A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary. Given that examples are deleted randomly, there is no way to detect or preserve “good” or more information-rich examples from the majority class.

… in random under-sampling (potentially), vast quantities of data are discarded. […] This can be highly problematic, as the loss of such data can make the decision boundary between minority and majority instances harder to learn, resulting in a loss in classification performance.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

The random undersampling technique can be implemented using the RandomUnderSampler imbalanced-learn class.

The class can be used just like the RandomOverSampler class in the previous section, except the strategies impact the majority class instead of the minority class. For example, setting the sampling_strategy argument to “majority” will undersample the majority class determined by the class with the largest number of examples.

...
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')

For example, a dataset with 1,000 examples in the majority class and 100 examples in the minority class will be undersampled such that both classes would have 100 examples in the transformed training dataset.

We can also set the sampling_strategy argument to a floating point value which will be a percentage relative to the minority class, specifically the number of examples in the minority class divided by the number of examples in the majority class. For example, if we set sampling_strategy to 0.5 in an imbalanced data dataset with 1,000 examples in the majority class and 100 examples in the minority class, then there would be 200 examples for the majority class in the transformed dataset (or 100/200 = 0.5).

...
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy=0.5)

This might be preferred to ensure that the resulting dataset is both large enough to fit a reasonable model, and that not too much useful information from the majority class is discarded.

In random under-sampling, one might attempt to create a balanced class distribution by selecting 90 majority class instances at random to be removed. The resulting dataset will then consist of 20 instances: 10 (randomly remaining) majority class instances and (the original) 10 minority class instances.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

The transform can then be fit and applied to a dataset in one step by calling the fit_resample() function and passing the untransformed dataset as arguments.

...
# fit and apply the transform
X_over, y_over = undersample.fit_resample(X, y)

We can demonstrate this on a dataset with a 1:100 class imbalance.

The complete example is listed below.

# example of random undersampling to balance the class distribution
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# summarize class distribution
print(Counter(y))
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy='majority')
# fit and apply the transform
X_over, y_over = undersample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over))

Running the example first creates the dataset and reports the imbalanced class distribution.

The transform is fit and applied on the dataset and the new class distribution is reported. We can see that that majority class is undersampled to have the same number of examples as the minority class.

Judgment and empirical results will have to be used as to whether a training dataset with just 200 examples would be sufficient to train a model.

Counter({0: 9900, 1: 100})
Counter({0: 100, 1: 100})

This undersampling transform can also be used in a Pipeline, like the oversampling transform from the previous section.

This allows the transform to be applied to the training dataset only using evaluation schemes such as k-fold cross-validation, avoiding any data leakage in the evaluation of a model.

...
# define pipeline
steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)

We can define an example of fitting a decision tree on an imbalanced classification dataset with the undersampling transform applied to the training dataset on each split of a repeated 10-fold cross-validation.

The complete example is listed below.

# example of evaluating a decision tree with random undersampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# define pipeline
steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)
score = mean(scores)
print('F1 Score: %.3f' % score)

Running the example evaluates the decision tree model on the imbalanced dataset with undersampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm rather than optimally solve the synthetic dataset.

The default undersampling strategy is used, which balances the majority classes with the minority class. The F1 score averaged across each fold and each repeat is reported.

Your specific results may differ given the stochastic nature of the dataset and the resampling strategy.

F1 Score: 0.889

Combining Random Oversampling and Undersampling

Interesting results may be achieved by combining both random oversampling and undersampling.

For example, a modest amount of oversampling can be applied to the minority class to improve the bias towards these examples, whilst also applying a modest amount of undersampling to the majority class to reduce the bias on that class.

This can result in improved overall performance compared to performing one or the other techniques in isolation.

For example, if we had a dataset with a 1:100 class distribution, we might first apply oversampling to increase the ratio to 1:10 by duplicating examples from the minority class, then apply undersampling to further improve the ratio to 1:2 by deleting examples from the majority class.

This could be implemented using imbalanced-learn by using a RandomOverSampler with sampling_strategy set to 0.1 (10%), then using a RandomUnderSampler with a sampling_strategy set to 0.5 (50%). For example:

...
# define oversampling strategy
over = RandomOverSampler(sampling_strategy=0.1)
# fit and apply the transform
X, y = over.fit_resample(X, y)
# define undersampling strategy
under = RandomUnderSampler(sampling_strategy=0.5)
# fit and apply the transform
X, y = under.fit_resample(X, y)

We can demonstrate this on a synthetic dataset with a 1:100 class distribution. The complete example is listed below:

# example of combining random oversampling and undersampling for imbalanced data
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# summarize class distribution
print(Counter(y))
# define oversampling strategy
over = RandomOverSampler(sampling_strategy=0.1)
# fit and apply the transform
X, y = over.fit_resample(X, y)
# summarize class distribution
print(Counter(y))
# define undersampling strategy
under = RandomUnderSampler(sampling_strategy=0.5)
# fit and apply the transform
X, y = under.fit_resample(X, y)
# summarize class distribution
print(Counter(y))

Running the example first creates the synthetic dataset and summarizes the class distribution, showing an approximate 1:100 class distribution.

Then oversampling is applied, increasing the distribution from about 1:100 to about 1:10. Finally, undersampling is applied, further improving the class distribution from 1:10 to about 1:2

Counter({0: 9900, 1: 100})
Counter({0: 9900, 1: 990})
Counter({0: 1980, 1: 990})

We might also want to apply this same hybrid approach when evaluating a model using k-fold cross-validation.

This can be achieved by using a Pipeline with a sequence of transforms and ending with the model that is being evaluated; for example:

...
# define pipeline
over = RandomOverSampler(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under), ('m', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)

We can demonstrate this with a decision tree model on the same synthetic dataset.

The complete example is listed below.

# example of evaluating a model with random oversampling and undersampling
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# define pipeline
over = RandomOverSampler(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under), ('m', DecisionTreeClassifier())]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1)
score = mean(scores)
print('F1 Score: %.3f' % score)

Running the example evaluates a decision tree model using repeated k-fold cross-validation where the training dataset is transformed, first using oversampling, then undersampling, for each split and repeat performed. The F1 score averaged across each fold and each repeat is reported.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm rather than optimally solve the synthetic dataset.

Your specific results may differ given the stochastic nature of the dataset and the resampling strategy.

F1 Score: 0.985

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

API

Articles

Summary

In this tutorial, you discovered random oversampling and undersampling for imbalanced classification

Specifically, you learned:

  • Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
  • Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
  • Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Random Oversampling and Undersampling for Imbalanced Classification appeared first on Machine Learning Mastery.

Go to Source