Missing value imputation is an ever-old question in data science and machine learning. Techniques go from the simple mean/median imputation to more sophisticated methods based on machine learning. How much of an impact approach selection has on the final results? As it turns out, a lot.
If you are more of a video person, there’s something for you too:
Let’s get a couple of things straight – missing value imputation is domain-specific more often than not. For example, a dataset might contain missing values because a customer isn’t using some service, so imputation would be the wrong thing to do.
Further, simple techniques like mean/median/mode imputation often don’t work well. And it’s easy to reason why. Extremes can influence average values in the dataset, the mean in particular. Also, filling 10% or more of the data with the same value doesn’t sound too peachy, at least for the continuous variables.
The article is structured as follows:
- Problems with KNN imputation
- What is MissForest?
- MissForest in practice
- MissForest evaluation
Problems with KNN imputation
Even some of the machine learning-based imputation techniques have issues. For example, KNN imputation is a great stepping stone from the simple average imputation but poses a couple of problems:
- You need to choose a value for K – not an issue for small datasets
- Is sensitive to outliers because it uses Euclidean distance below the surface
- Can’t be applied to categorical data, as some form of conversion to numerical representation is required
- Can be computationally expensive, but that depends on the size of your dataset
Don’t get me wrong, I would pick KNN imputation over a simple average any day, but there are still better methods. If you want to find out more on the topic, here’s my recent article:
So, what can you do? MissForest to the rescue.
What is MissForest?
MissForest is a machine learning-based imputation technique. It uses a Random Forest algorithm to do the task. It is based on an iterative approach, and at each iteration the generated predictions are better. You can read more about the theory of the algorithm below, as Andre Ye made great explanations and beautiful visuals:
This article aims more towards practical application, so we won’t dive too much into the theory. To summarize, MisForrest is excellent because:
- Doesn’t require extensive data preparation – as a Random forest algorithm can determine which features are important
- Doesn’t require any tuning – like K in K-Nearest Neighbors
- Doesn’t care about categorical data types – Random forest knows how to handle them
Next, we’ll dive deep into a practical example.
MissForest in practice
We’ll work with the Iris dataset for the practical part. The dataset doesn’t contain any missing values, but that’s the whole point. We will produce missing values randomly, so we can later evaluate the performance of the MissForest algorithm.
Before I forget, please install the required library by executing
pip install missingpy from the Terminal.
Great! Next, let’s import Numpy and Pandas and read in the mentioned Iris dataset. We’ll also make a copy of the dataset so that we can evaluate with real values later on:
All right, let’s now make two lists of unique random numbers ranging from zero to the Iris dataset’s length. With some Pandas manipulation, we’ll replace the values of
petal_width with NaNs, based on the index positions generated randomly:
As you can see, the
petal_width contains only 14 missing values. That’s because the randomization process created two identical random numbers. It doesn’t pose any problem to us, as in the end, the number of missing values is arbitrary.
The next step is to, well, perform the imputation. We’ll have to remove the target variable from the picture too. Here’s how:
And that’s it – missing values are now imputed!
But how do we evaluate the damn thing? That’s the question we’ll answer next.
To perform the evaluation, we’ll make use of our copied, untouched dataset. We’ll add two additional columns representing the imputed columns from the MissForest algorithm – both for
We’ll then create a new dataset containing only these two columns – in the original and imputed states. Finally, we will calculate the absolute errors for further inspection.
Here’s the code:
As you can see, the last line of code selects only those rows on which imputation was performed. Let’s take a look:
All absolute errors are small and well within a single standard deviation from the original’s average. The imputed value looks natural if you don’t take into account the added decimal places. That can be easily fixed if necessary.
This was a short, simple, and to the point article on missing value imputation with machine learning methods. You’ve learned why machine learning is better than the simple average in this realm and why MissForest outperforms KNN imputer.
I hope it was a good read for you. Take care.