Data ScienceSQL

Let’s Impute Missing Values with SQL

This article was originally published on Towards Data Science on August 25th, 2020.

Missing values are a huge problem in machine learning. In a day and age when machine learning can be done directly in the database, one wonders how to perform adequate data preparation with SQL, without other programming languages, such as Python and R. Today we’ll see just how easy it is.

We’ll use Oracle Cloud for the purpose of this article, as it’s free and can be used without any downloads and installations on your machine — through the SQL Developer Web. If you decide to follow along, create a free OLTP database, and go to Service Console — Development — SQL Developer Web.

With regards to the dataset, we’ll use the well-known Titanic dataset for two reasons:

  • It’s simple and easy to understand
  • It contains enough missing values for us to play with

Once you have the dataset downloaded, you can use the Upload Data functionality of SQL Developer Web to create the table and upload data:

Image for post

Change data types using your best judgment and you’re ready to roll!


Preparation and exploration

I don’t want to mess anything up with the source table, called titanic, so let’s make a copy of it:

CREATE TABLE cp_titanic AS 
SELECT * FROM titanic;

Let’s just make a quick Select to verify everything is as it should be:

SELECT * FROM cp_titanic;
Image for post

Awesome! If you’ve done any work with this dataset, you know the Agecolumn is the most problematic one when it comes to missing values. Let’s just check how many there are:

SELECT COUNT(*) AS num_age_missing 
FROM cp_titanic
WHERE age IS NULL;
Image for post

Now that’s a lot if we take into account that the dataset has less than 1000 rows. For that reason, we need to figure out how to fill these missing values and do so in an easy and automated fashion.


Imputation time

To handle missing data imputation, we’ll create a PL/SQL procedure. Don’t worry if you don’t know what that is, as it will be utterly simple to follow along. Before we do anything in code, let’s list things this procedure should do:

  1. Obtain a value used for imputation (mean/median/mode will do the trick for this simple example)
  2. Make an update to the table — replace NULL values with the calculated value

Great! This shouldn’t be hard to do. We’ll also be accepting some input from the user — 3 parameters to be more precise:

  • Table name — text representation of a table where missing data is located
  • Attribute — a column which contains missing values
  • Impute method — a way on which imputation is done — either mean, median, or mode

And that’s all we have to know to get started. Let’s create a procedure with what we know so far:

CREATE OR REPLACE PROCEDURE impute_missing(
in_table_name IN VARCHAR2,
in_attribute IN VARCHAR2,
in_impute_method IN VARCHAR2 DEFAULT ‘mean’
) IS
BEGIN
END;
/

Awesome! Below the IS keyword we’ll declare some other variables — for holding things such as impute function that will be used, a statement for getting the average value, the average value itself, and an update statement. Our procedure now looks as follows:

CREATE OR REPLACE PROCEDURE impute_missing(
in_table_name IN VARCHAR2,
in_attribute IN VARCHAR2,
in_impute_method IN VARCHAR2 DEFAULT ‘mean’
) IS
impute_func VARCHAR2(16);
get_avg_stmt VARCHAR2(128);
avg_value NUMBER;
update_stmt VARCHAR2(128);
BEGIN
END;
/

The remaining of the code will be located between the BEGIN and ENDkeywords.

Impute method

As discussed earlier, our procedure can handle missing value imputation by using mean, median, or mode statistical functions. Also, those are values that the user can provide for the in_impute_method parameter.

The only problem is — these statistical functions are called a bit differently in SQL. We can use the CASE operator to handle this and store the result into impute_func variable. Here’s the code:

CASE in_impute_method
WHEN ‘mean’ THEN impute_func := ‘AVG’;
WHEN ‘median’ THEN impute_func := ‘MEDIAN’;
WHEN ‘mode’ THEN impute_func := ‘STATS_MODE’;
ELSE RAISE_APPLICATION_ERROR(-20001, ‘Invalid impute method!’);
END CASE;

That wasn’t hard, was it?

We’re almost halfway done, and the following parts are also plain simple.

Get the average value

To get the average value, we’ll have to use something known as a Dynamic SQL. This means we won’t hardcode the SQL statements, but instead, the statements get created based on provided user input.

Let’s see how to create a Dynamic query for getting the average value, with respect to parameters the user has entered (table name, attribute, and impute method):

get_avg_stmt :=
q’[SELECT ]’
|| impute_func
|| q’[(]’
|| in_attribute
|| q’[) FROM ]’
|| in_table_name;DBMS_OUTPUT.PUT_LINE(‘get_avg_stmt = ‘ || get_avg_stmt);

It might look and feel a bit strange until you get used to this, but you’ll see later what it all boils down to. This DBMS_OUTPUT line isn’t required, but it prints the get_avg_stmt to the console once we run the procedure.

This isn’t enough though, as we still need to execute this statement and store its result in the avg_value variable. It’s simple to do:

BEGIN
EXECUTE IMMEDIATE get_avg_stmt INTO avg_value;
END;DBMS_OUTPUT.PUT_LINE(‘avg_value = ‘ || avg_value);

This part is now done, and now we should somehow make an update to the table to actually fill the missing values. Let’s see how.

Imputation

If you’ve understood the previous section, you’ll understand this one. Once again, we need to create a SQL statement dynamically and execute it. The only difference is that this time results won’t be stored in a variable, as that makes no sense for an UPDATE statement. The table is modified in-place, instead:

update_stmt :=
q’[UPDATE ]’
|| in_table_name
|| q’[ SET ]’
|| in_attribute
|| q’[ = ]’
|| avg_value
|| q’[ WHERE ]’
|| in_attribute
|| q’[ IS NULL]’;DBMS_OUTPUT.PUT_LINE(‘update_stmt = ‘ || update_stmt);BEGIN
EXECUTE IMMEDIATE update_stmt;
END;COMMIT;

The only new thing here is the COMMIT keyword. It is used to end the current transaction and make all of the changes permanent. If you’re not sure which statements have to be committed, here’s an explanation. You’re welcome.

And that’s it basically, we’ve done everything we had to.

Recap

Here’s the entire procedure, just in case you’ve missed something:

CREATE OR REPLACE PROCEDURE impute_missing(
in_table_name IN VARCHAR2,
in_attribute IN VARCHAR2,
in_impute_method IN VARCHAR2 DEFAULT ‘mean’
) IS
impute_func VARCHAR2(16);
get_avg_stmt VARCHAR2(128);
avg_value NUMBER;
update_stmt VARCHAR2(128);
BEGIN
CASE in_impute_method
WHEN ‘mean’ THEN impute_func := ‘AVG’;
WHEN ‘median’ THEN impute_func := ‘MEDIAN’;
WHEN ‘mode’ THEN impute_func := ‘STATS_MODE’;
ELSE RAISE_APPLICATION_ERROR(-20001, ‘Invalid impute method!’);
END CASE; get_avg_stmt :=
q’[SELECT ]’
|| impute_func
|| q’[(]’
|| in_attribute
|| q’[) FROM ]’
|| in_table_name;
DBMS_OUTPUT.PUT_LINE(‘get_avg_stmt = ‘ || get_avg_stmt); BEGIN EXECUTE IMMEDIATE get_avg_stmt INTO avg_value;
END;
DBMS_OUTPUT.PUT_LINE(‘avg_value = ‘ || avg_value); update_stmt :=
q’[UPDATE ]’
|| in_table_name
|| q’[ SET ]’
|| in_attribute
|| q’[ = ]’
|| avg_value
|| q’[ WHERE ]’
|| in_attribute
|| q’[ IS NULL]’;
DBMS_OUTPUT.PUT_LINE(‘update_stmt = ‘ || update_stmt); BEGIN EXECUTE IMMEDIATE update_stmt;
END;
COMMIT;END;
/

And now, let’s test the thing!


Testing

We’ve done all the hard work, and now we can test if everything works as advertised. To do so, we’ll impute the missing values of age column with its median value. Here’s how:

BEGIN
impute_missing(‘cp_titanic’, ‘age’, ‘median’);
END;
/

And that’s it! If we execute this block of code, we’ll get some output to the console, due to all DBMS_OUTPUT calls in the procedure:

Image for post

Awesome! If we once again check for the number of missing values, we can see that everything worked out as it should:

SELECT COUNT(*) AS num_age_missing 
FROM cp_titanic 
WHERE age IS NULL;
Image for post

What can be improved?

Our procedure does the job, but that doesn’t mean it can’t be improved further. Here are a couple of ideas:

  • Round the result — calling AVG on pretty much anything will result in a lot of decimal spaces
  • Have the option for the user to enter the value for imputation — instead of using mean/median/mode

This is not a definite list, so feel free to tailor the procedure to your needs.


Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Leave a reply

Your email address will not be published. Required fields are marked *

More in Data Science