Data ScienceMachine LearningPython

How to Effortlessly Handle Class Imbalance with Python and SMOTE

How many times did you get a 99% accurate model that’s unusable? Building classification models is no-joke, especially when there’s a class imbalance in your data. You know, when there’s only one fraud in 1000 transactions.

You see, identifying 950 of 999 genuine transactions is easy. The trick is to correctly identify a single fraud case – every time. 

Reading feels like a nightmare? There’s an easy solution:

That’s where SMOTE (Synthetic Minority Over-sampling Technique) comes in handy. You can use it to oversample the minority class. SMOTE is a type of data augmentation that synthesizes new samples from the existing ones.

Yes – SMOTE actually creates new samples. It is light years ahead from simple duplication of the minority class. That approach stupidly creates “new” data points by duplicating existing ones. As a result, no new information is brought to the dataset.

But how does SMOTE do it? 

It selects samples in the minority class that are close and then draws lines between them. New sample points are located on these lines.

To be more precise, a random sample is chosen, and then a KNN algorithm is used to select neighbors to which lines are drawn. With this procedure, you can create as many synthetic samples as needed. This makes SMOTE perfect for datasets of all sizes.

The only real downside is that synthetic examples are created without “consulting” the majority class. This could result in overlapping samples in both classes.

That’s the only theory you’ll need to understand this article. 

Here are the topics covered next:

Dataset loading and preparation

So you need a classification dataset that suffers from a class imbalance problem. Something like credit card fraud detection should do. Here’s one from Kaggle you can download for free. 

Here’s how to load it with Python:

Head of credit card fraud dataset

Image 1 – Head of credit card fraud dataset (image by author)

There are twenty-something columns which you’ll prepare in a bit. First, let’s explore the target class distribution:

Target class distribution

Image 2 – Target variable class distribution (image by author)

Yikes. Only around 1.68% of transactions are classified as fraud. A great recipe to make high-accuracy low-recall models. More on that in a bit.

You can’t pass a dataset to a machine learning algorithm in this form. Some preparation is a must.

Data preparation

You won’t spend much time here. The goal is to get a minimum viable dataset for machine learning. 

Here’s the list of initial changes:

  • Remap gender, car, and reality to integers (0, 1) – these columns have only two possible values
  • Create dummy variables for income_type, education_type, family_name, house_type – to go from strings to binary (0, 1)
  • Drop unnecessary columns – Unnamed: 0, ID, and every column for which you created dummy variables
  • Merge all into a single data frame

Here’s the code for that: 

The dataset now looks like this:

Dataset after preparation

Image 3 – Dataset after initial preparation (image by author)

Better, but still needs a bit of work. Notice how larger the values are in income than in no_of_child. That’s expected, but machine learning algorithms will give more importance to variables on a larger scale. Introducing data scaling.

You’ll use MinMaxScaler from scikit-learn to scale columns that have values greater than 1 to [0, 1] range. Here’ how:

Here’s how the dataset looks now:

Dataset after scaling

Image 4 – Dataset after scaling (image by author)

Much better – everything is in the [0, 1] range, all columns are numerical, and there are no missing values.

This means one thing – the dataset is machine learning ready.

Machine learning without SMOTE

Let’s start with a naive approach. You’ll create a Random Forest model on the dataset and completely ignore the class imbalance. 

To start, you’ll have to split the dataset into training and testing portions. There’s only 1.68% of fraud transactions in the entire dataset. Ideally, you want the percentage roughly the same in the train and test sets. 

Here’s how to do the split and check the percentage of the positive class:

Percentage of positive class in both sets

Image 5 – Percentage of positive class in train and test sets (image by author)

Onto the modeling now. Let’s make it as simple as possible. You’ll train a Random Forest classifier on the train set and evaluate it on the test set. Confusion matrix, accuracy score, and recall score will tell you just how bad is it:

Evaluation of a model without SMOTE

Image 6 – Accuracy, recall, and confusion matrix of a model without using SMOTE (image by author)

The model is 98% accurate, so where’s the problem? 

Yes, it can correctly classify almost all genuine transactions. But it also classified 91% of fraud transactions as genuine. In a nutshell – the model is unusable.

Class imbalance killed its performance. SMOTE can help.

Machine learning with SMOTE

You already know what SMOTE is, and now you’ll see how to install it and use it. Execute the following command from Terminal:

pip install imbalanced-learn

You can now apply SMOTE to features (X) and the target (y) and store the results in dedicated variables. The new feature and target set is larger, due to oversampling. Here’s the code for applying SMOTE:

Shapes and class balance after applying SMOTE

Image 7 – Shapes and class balance after applying SMOTE (image by author)

 There are 37K data points instead of 25K, and the class balance is perfect – 50:50. You’ll train the model on a new dataset next:

Evaluation of a model after SMOTE

Image 8 – Accuracy, recall, and confusion matrix of a model after using SMOTE (image by author)

The resulting model is usable, to say at least. SMOTE did its job, and it resulted in a model that significantly outperformed its previous version. 

Let’s wrap things up next.

Conclusion

And there you have it – SMOTE in a nutshell. You can use it whenever a dataset suffers from a class imbalance problem. The go-to approach nowadays is to use both undersampling and oversampling, but that’s a topic for another time.

Just for fun, you can compare the misclassifications of both models. By doing so, you could see if the model built after oversampling still misclassifies the same data points.

How do you handle class imbalance? Let me know in the comments below.

Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

4 Comments

  1. Very interesting topic, thank you for posting.

  2. Im obliged for the blog post. Really looking forward to read more. Fantastic. Clementina Hewet Reuben

  3. I believe you have noted some very interesting details , thanks for the post. Eva Zebulon Ermina

  4. I like this post, enjoyed this one appreciate it for posting . Maryrose Antoni Gelasius

Leave a reply

Your email address will not be published. Required fields are marked *

More in Data Science