How many times did you get a 99% accurate model that’s unusable? Building classification models is no-joke, especially when there’s a class imbalance in your data. You know, when there’s only one fraud in 1000 transactions.
You see, identifying 950 of 999 genuine transactions is easy. The trick is to correctly identify a single fraud case – every time.
Reading feels like a nightmare? There’s an easy solution:
That’s where SMOTE (Synthetic Minority Over-sampling Technique) comes in handy. You can use it to oversample the minority class. SMOTE is a type of data augmentation that synthesizes new samples from the existing ones.
Yes – SMOTE actually creates new samples. It is light years ahead from simple duplication of the minority class. That approach stupidly creates “new” data points by duplicating existing ones. As a result, no new information is brought to the dataset.
But how does SMOTE do it?
It selects samples in the minority class that are close and then draws lines between them. New sample points are located on these lines.
To be more precise, a random sample is chosen, and then a KNN algorithm is used to select neighbors to which lines are drawn. With this procedure, you can create as many synthetic samples as needed. This makes SMOTE perfect for datasets of all sizes.
The only real downside is that synthetic examples are created without “consulting” the majority class. This could result in overlapping samples in both classes.
That’s the only theory you’ll need to understand this article.
Here are the topics covered next:
- Dataset loading and preparation
- Machine learning without SMOTE
- Machine learning with SMOTE
Dataset loading and preparation
So you need a classification dataset that suffers from a class imbalance problem. Something like credit card fraud detection should do. Here’s one from Kaggle you can download for free.
Here’s how to load it with Python:
There are twenty-something columns which you’ll prepare in a bit. First, let’s explore the target class distribution:
Yikes. Only around 1.68% of transactions are classified as fraud. A great recipe to make high-accuracy low-recall models. More on that in a bit.
You can’t pass a dataset to a machine learning algorithm in this form. Some preparation is a must.
You won’t spend much time here. The goal is to get a minimum viable dataset for machine learning.
Here’s the list of initial changes:
realityto integers (0, 1) – these columns have only two possible values
- Create dummy variables for
house_type– to go from strings to binary (0, 1)
- Drop unnecessary columns –
ID, and every column for which you created dummy variables
- Merge all into a single data frame
Here’s the code for that:
The dataset now looks like this:
Better, but still needs a bit of work. Notice how larger the values are in
income than in
no_of_child. That’s expected, but machine learning algorithms will give more importance to variables on a larger scale. Introducing data scaling.
scikit-learn to scale columns that have values greater than 1 to [0, 1] range. Here’ how:
Here’s how the dataset looks now:
Much better – everything is in the [0, 1] range, all columns are numerical, and there are no missing values.
This means one thing – the dataset is machine learning ready.
Machine learning without SMOTE
Let’s start with a naive approach. You’ll create a Random Forest model on the dataset and completely ignore the class imbalance.
To start, you’ll have to split the dataset into training and testing portions. There’s only 1.68% of fraud transactions in the entire dataset. Ideally, you want the percentage roughly the same in the train and test sets.
Here’s how to do the split and check the percentage of the positive class:
Onto the modeling now. Let’s make it as simple as possible. You’ll train a Random Forest classifier on the train set and evaluate it on the test set. Confusion matrix, accuracy score, and recall score will tell you just how bad is it:
The model is 98% accurate, so where’s the problem?
Yes, it can correctly classify almost all genuine transactions. But it also classified 91% of fraud transactions as genuine. In a nutshell – the model is unusable.
Class imbalance killed its performance. SMOTE can help.
Machine learning with SMOTE
You already know what SMOTE is, and now you’ll see how to install it and use it. Execute the following command from Terminal:
pip install imbalanced-learn
You can now apply SMOTE to features (X) and the target (y) and store the results in dedicated variables. The new feature and target set is larger, due to oversampling. Here’s the code for applying SMOTE:
There are 37K data points instead of 25K, and the class balance is perfect – 50:50. You’ll train the model on a new dataset next:
The resulting model is usable, to say at least. SMOTE did its job, and it resulted in a model that significantly outperformed its previous version.
Let’s wrap things up next.
And there you have it – SMOTE in a nutshell. You can use it whenever a dataset suffers from a class imbalance problem. The go-to approach nowadays is to use both undersampling and oversampling, but that’s a topic for another time.
Just for fun, you can compare the misclassifications of both models. By doing so, you could see if the model built after oversampling still misclassifies the same data points.
How do you handle class imbalance? Let me know in the comments below.