Machine LearningSQL

Machine Learning With SQL - It’s Easier Than You Think

If you’ve been studying data science, it’s likely you know how to perform machine learning tasks in languages like PythonR, and Julia. But what can you do when speed is the key, the hardware is limited, or the company you work for treats SQL as the only option for predictive analytics? In-database machine learning is the answer.

We’ll use Oracle Cloud for this article. It’s free, so please register and create an instance of the OLTP database (Version 19c, has 0.2TB of storage). Once done, download the cloud wallet and establish a connection through SQL Developer — or any other tool.

This will take you 10 minutes at least but is a fairly straightforward thing to do, so I won’t waste time on it.

We’ll use Oracle Machine Learning (OML) to train a classification model on the well-known Iris dataset. I’ve chosen it because it doesn’t require any preparation — we only need to create the table and insert the data.

Let’s do that next.

Data preparation

As mentioned, we need to create a table for holding the Iris dataset, and then we need to load data to it. OML requires one column to be used as row ID (sequence), so let’s keep that in mind:

CREATE TABLE iris_data(
iris_id NUMBER DEFAULT seq_iris.NEXTVAL,
sepal_length NUMBER,
sepal_width NUMBER,
petal_length NUMBER,
petal_width NUMBER,
species VARCHAR2(16)

Awesome! Now we can download the data and load it:

When a modal window pops-up simply provide a path to the downloaded CSV and click Next a couple of times. SQL Developer should get things right without your assistance.

Once done, we have our dataset loaded and prepared:

Let’s continue with the fun part now.

Model training

Now we can get our hands dirty with the fun stuff, and that’s training the classification model. This is broken down into multiple steps, such as train/test split, model training, and model evaluation. Let’s start with the simplest one.

Train/test split

Oracle likes this step done with two views — one for training data and one for testing data. We can easily create those with a bit of PL/SQL magic:

iris_train_data AS
SELECT * FROM iris_data
SAMPLE (75) SEED (42)’;
iris_test_data AS
SELECT * FROM iris_data
SELECT * FROM iris_train_data’;

This script does two things:

  1. Creates a train view — has 75% of data (SAMPLE (75)) split at the random seed 42 ( SEED (42))
  2. Creates a test view — as a difference of the entire dataset and the training view

Our data is stored in views named iris_train_data and iris_test_data — you guess which one holds what.

Let’s quickly check how many rows are in each:

SELECT COUNT(*) FROM iris_train_data;
>>> 111

SELECT COUNT(*) FROM iris_test_data;
>>> 39

We are ready to train the model, so let’s do that next.

Model training

The easiest method for model training is through DBMS_DATA_MINING package, with a single procedure execution, and without the need for creating additional settings tables.

We’ll use the Decision Tree algorithm to train our model. Here’s how:

v_setlst(‘PREP_AUTO’) := ‘ON’;

‘SELECT * FROM iris_train_data’,

The CREATE_MODEL2 procedure (curious why it wasn’t named CREATE_MODEL_FINAL_FINAL89) accepts a lot of parameters. Let’s explain the ones we entered:

  • iris_clf_model — simply the name of your model. Can be anything
  • CLASSIFICATION — type of machine learning task we’re doing. Must be uppercase for some reason
  • SELECT * FROM iris_train_data — specifies where the training data is stored
  • v_setlst — above declared settings list for our model
  • iris_id — name of the sequence type column (each value is unique)
  • species — name of the target variable (what we’re trying to predict)

Executing this block will take a second or two, but once done it’s ready for evaluation!

Model evaluation

Let’s use this script to evaluate our model:


It applies iris_clf_model to the unseen test data iris_test_data and stores evaluation results into a iris_apply_result table. Here’s how this table looks like:

It has many more rows (39 x 3), but you get the point. This still isn’t the most straightforward thing to look at, so let’s show the results in a slightly different way:

CURSOR iris_ids IS
SELECT DISTINCT(iris_id) iris_id
FROM iris_apply_result
ORDER BY iris_id;
curr_y VARCHAR2(16);
curr_yhat VARCHAR2(16);
num_correct INTEGER := 0;
num_total INTEGER := 0;
FOR r_id IN iris_ids LOOP
‘SELECT species FROM
WHERE iris_id = ‘ || r_id.iris_id
INTO curr_y;
‘SELECT prediction
FROM iris_apply_result
WHERE iris_id = ‘ || r_id.iris_id ||
‘AND probability = (
SELECT MAX(probability)
FROM iris_apply_result
WHERE iris_id = ‘ || r_id.iris_id ||
‘)’ INTO curr_yhat;

num_total := num_total + 1;
IF curr_y = curr_yhat THEN
num_correct := num_correct + 1;

DBMS_OUTPUT.PUT_LINE(‘Num. test cases: ‘
|| num_total);
DBMS_OUTPUT.PUT_LINE(‘Num. correct : ‘
|| num_correct);
|| ROUND((num_correct / num_total), 2));

Yes, it’s a lot, but the script above can’t be any simpler. Let’s break it down:

  • CURSOR — gets all distinct iris_ids (because we have them duplicated in iris_apply_results table
  • curr_ycurr_yhatnum_correctnum_total are variables for storing actual species and predicted species at every iteration, number of correct classifications, and total number of test items
  • For every unique iris_id we get the actual species (from iris_test_data, where ids match) and the predicted species (where prediction probability is the highest in iris_apply_results table)
  • Then it’s easy to check if the actual and predicted values are identical — which indicates the classification is correct
  • Variables num_total and num_correct are updated at every iteration
  • Finally, we print the model’s performance to the console

Here’s the output for this script:

Awesome! To interpret:

  • The test set has 39 cases
  • Of these 39, 37 were classified correctly
  • Which results in the 95% accuracy

And that’s pretty much it for the model evaluation.

Before you go

And there you have it — machine learning project written from scratch in SQL. Not all of us have the privilege to work with something like Python on our job, and if a machine learning task comes on your desk you now know how to solve it via SQL.

This was just a simple classification task, of course, and scripts can be improved further, but you get the point. I hope you’ve managed to follow along. For any questions and comments, please refer to the comment section.

Thanks for reading.

Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like


  1. I know thiѕ if off topic but I’m looking into starting my own bⅼog and
    waѕ wonderіng what aⅼl is required to get ѕetup?
    I’m assuming having a blog liкe yours would cost a pretty penny?
    I’m not very internet smart so I’m not 100% sure. Any tips or
    advice would be greatly apprecіated. Thank you

    1. Hi,
      you can always start on places like Medium, but if you want your own website it will cost you around $100 per year. You’ll need to pick a domain (see GoDaddy) and hosting plan (BlueHost is good). There’s plenty of tutorials on YouTube.

  2. […] As it turns out, this can also be performed only with SQL! Here’s how. […]

  3. Hiya, I’m really glad I’ve found this info. Today bloggers publish just about gossips and web and this is really irritating. A good site with exciting content, that’s what I need. Thanks for keeping this website, I will be visiting it. Do you do newsletters? Can not find it.

Leave a reply

Your email address will not be published. Required fields are marked *