Data PreparationData ScienceMachine Learning

How to Use Python and MissForest Algorithm to Impute Missing Data

Missing value imputation is an ever-old question in data science and machine learning. Techniques go from the simple mean/median imputation to more sophisticated methods based on machine learning. How much of an impact approach selection has on the final results? As it turns out, a lot.

If you are more of a video person, there’s something for you too:

Let’s get a couple of things straight – missing value imputation is domain-specific more often than not. For example, a dataset might contain missing values because a customer isn’t using some service, so imputation would be the wrong thing to do.

Further, simple techniques like mean/median/mode imputation often don’t work well. And it’s easy to reason why. Extremes can influence average values in the dataset, the mean in particular. Also, filling 10% or more of the data with the same value doesn’t sound too peachy, at least for the continuous variables.

The article is structured as follows:

Problems with KNN imputation

Even some of the machine learning-based imputation techniques have issues. For example, KNN imputation is a great stepping stone from the simple average imputation but poses a couple of problems:

  • You need to choose a value for K – not an issue for small datasets
  • Is sensitive to outliers because it uses Euclidean distance below the surface
  • Can’t be applied to categorical data, as some form of conversion to numerical representation is required
  • Can be computationally expensive, but that depends on the size of your dataset

Don’t get me wrong, I would pick KNN imputation over a simple average any day, but there are still better methods. If you want to find out more on the topic, here’s my recent article:

Missing Value Imputation with Python and K-Nearest Neighbors

So, what can you do? MissForest to the rescue.

What is MissForest?

MissForest is a machine learning-based imputation technique. It uses a Random Forest algorithm to do the task. It is based on an iterative approach, and at each iteration the generated predictions are better. You can read more about the theory of the algorithm below, as Andre Ye made great explanations and beautiful visuals:

Towards Data Science – MissForest: The Best Missing Data Imputation Algorithm?

This article aims more towards practical application, so we won’t dive too much into the theory. To summarize, MisForrest is excellent because:

  • Doesn’t require extensive data preparation – as a Random forest algorithm can determine which features are important
  • Doesn’t require any tuning – like K in K-Nearest Neighbors
  • Doesn’t care about categorical data types – Random forest knows how to handle them

Next, we’ll dive deep into a practical example.

MissForest in practice

We’ll work with the Iris dataset for the practical part. The dataset doesn’t contain any missing values, but that’s the whole point. We will produce missing values randomly, so we can later evaluate the performance of the MissForest algorithm.

Before I forget, please install the required library by executing pip install missingpy from the Terminal.

Great! Next, let’s import Numpy and Pandas and read in the mentioned Iris dataset. We’ll also make a copy of the dataset so that we can evaluate with real values later on:

Iris dataset head

Iris dataset head

All right, let’s now make two lists of unique random numbers ranging from zero to the Iris dataset’s length. With some Pandas manipulation, we’ll replace the values of sepal_length and petal_width with NaNs, based on the index positions generated randomly:

Missing values count

Missing values count

As you can see, the petal_width contains only 14 missing values. That’s because the randomization process created two identical random numbers. It doesn’t pose any problem to us, as in the end, the number of missing values is arbitrary.

The next step is to, well, perform the imputation. We’ll have to remove the target variable from the picture too. Here’s how:

And that’s it – missing values are now imputed!

But how do we evaluate the damn thing? That’s the question we’ll answer next.

MissForest evaluation

To perform the evaluation, we’ll make use of our copied, untouched dataset. We’ll add two additional columns representing the imputed columns from the MissForest algorithm – both for sepal_length and petal_width

We’ll then create a new dataset containing only these two columns – in the original and imputed states. Finally, we will calculate the absolute errors for further inspection.

Here’s the code:

As you can see, the last line of code selects only those rows on which imputation was performed. Let’s take a look:

Original vs. Imputed data

Original vs. Imputed data

All absolute errors are small and well within a single standard deviation from the original’s average. The imputed value looks natural if you don’t take into account the added decimal places. That can be easily fixed if necessary.

Parting words

This was a short, simple, and to the point article on missing value imputation with machine learning methods. You’ve learned why machine learning is better than the simple average in this realm and why MissForest outperforms KNN imputer. 

I hope it was a good read for you. Take care.

Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Leave a reply

Your email address will not be published. Required fields are marked *