Data ScienceMachine LearningSQL

How to Calculate Feature Importance with SQL – The Essential Guide

Machine learning isn’t reserved just for Python and R anymore. Much can be done directly in the database with the language everyone knows – SQL. Today we’ll explore how to use it to calculate feature importance, with around five lines of code.

Don’t feel like reading? Check out my video on the topic:

We’ll use Oracle Cloud for this article. It’s free, so please register and create an instance of the OLTP database (Version 19c, has 0.2TB of storage). Once done, establish a connection through the SQL Developer Web — or any other tool.

As for the dataset, we’ll use a Telco industry churn dataset, available for download here. I’ve chosen this dataset because it has many features and doesn’t require any manual preparation.

The article is structured as follows:

What is feature importance? 

Feature importance is a technique that assigns a score to the input features (attributes) based on how useful they are for prediction (for predicting the target variable). 

The concept is essential for predictive modeling because you want to keep only the important features and discard others. By doing so, you reduce both the dimensionality and the noise. It can also lead you to a better understanding of your data, for obvious reasons.

Further, calculating feature importance can provide insights into the workings of a predictive model. By knowing the importance scores, you can immediately tell what the model thinks is the most important, and why it makes predictions the way it does.

Additionally, it’s always good to have a quantitative confirmation, and not work on pure assumptions.

Dataset loading

If you are following along, you should have the dataset downloaded. You’ll have to load it into the database with a tool of your choice – I’m using the SQL Developer Web, but you can use pretty much anything.

The loading process is straightforward – click on the Upload button, choose the dataset, and click Next a couple of times:

Dataset loading

Image 1 – Dataset loading with SQL Developer Web

Mine is stored in the churn table. Let’s take a look at what’s inside by executing a SELECT * FROM churn statement:

Churn dataset head

Image 2 – First 10 rows of the Churn dataset

You can now proceed with the feature importance calculation.

Feature importance with SQL 

As promised, this will take only a couple of lines of code. The feature importance is obtained through the EXPLAIN procedure of the DBMS_PREDICTIVE_ANALYTICS package. It expects three parameters:

  • data_table_name – where the source data is stored
  • explain_column_name – the name of the target variable
  • result_table_name – a new table where feature importances are stored

Here’s how to implement it in code:

And that’s it! You can now take a look at importances with a simple SELECT statement. The one below orders the features, so the most important ones are displayed first:

Here are the results:

Feature importances

Image 3 – Feature importances

From the previous image, you can immediately tell what is and what isn’t important. The next step would be to use this information for a predictive model. That’s a bit out of this article’s scope, but you’d want to proceed with the classification modeling.

As it turns out, this can also be performed only with SQL! Here’s how.

Parting words

I think you didn’t expect that feature importance calculation with SQL was this easy. But it is, just like the rest of in-database machine learning. SQL still isn’t a language for machine learning, but we can say that the future looks promising with these recent advancements. 

Don’t miss out on the rest of machine learning with SQL series:

Feel free to leave your thoughts in the comment section below.

Join my private email list for more helpful insights.

 

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Leave a reply

Your email address will not be published. Required fields are marked *

More in Data Science