Data ScienceSQL

Top 5 SQL Analytic Functions Every Data Analyst Needs to Know

This article was originally published on Towards Data Science on August 27th, 2020.

Analytical functions are one of the most popular tools among BI/Data analysts for performing complex data analysis. These functions perform computations over multiple rows and return the multiple rows as well. Today we’ll cover 5 functions I find most useful, with a lot of practical examples.

For the uninitiated, the relative size of a query with an analytical function might seem a bit intimidating. Don’t worry, we have you covered. Most of these functions follow a basic syntax:

analytic_function_name([argument_list])
OVER (
[PARTITION BY partition_expression,…]
[ORDER BY sort_expression, … [ASC|DESC]])

There are three parts to this syntax, namely functionpartition by and order by. Let’s briefly cover what each one does:

  • analytic_function_name: name of the function — like RANK()SUM()FIRST(), etc
  • partition_expression: column/expression on the basis of which the partition or window frames have to be created
  • sort_expression: column/expression on the basis of which the rows in the partition will be sorted

Okay, we’ve covered the basics thus far. For the practical part we’re gonna use the Orders table stored inside the PostgreSQL database:

Image for post

Let’s begin with the practical part now, shall we?


AVG() and SUM()

We’ve all been using aggregate functions such as SUMAVGMINMAX, and COUNT in our GROUP BY clauses. But when these functions are used over an ORDER BY clause they can give us running sum, mean, total, etc.

The following example will make it a lot more clear — we want to calculate the running average revenue and total revenue for each agent in the third quarter:

SELECT ord_date, agent_code, AVG(ord_amount) OVER (
PARTITION BY agent_code
ORDER BY ord_date
) running_agent_avg_revenue,
SUM (ord_amount) OVER (
PARTITION BY agent_code
ORDER BY ord_date
) running_agent_total_revenue
FROM orders
WHERE ord_date BETWEEN ‘2008–07–01’ AND ‘2008–09–30’;

And here are the results:

Image for post

Awesome! These functions are simple and require no additional explanations. Let’s proceed.


FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE()

FIRST_VALUE() is an analytical function that returns the value of the specified column from the first row of the window frame. If you‘ve understood the previous sentence, LAST_VALUE() is self-explanatory. It fetches the value from the last row.

PostgreSQL provides us with one more additional function called NTH_VALUE(column_name, n) that fetches the value from the n-th row. Isn’t it great? No more complex self joins.

Let’s answer the following question — How many days after the first purchase of a customer was the next purchase made?

SELECT cust_code, ord_date, ord_date — FIRST_VALUE(ord_date) OVER (
PARTITION BY cust_code
ORDER BY ord_date) next_order_gap
FROM orders
ORDER BY cust_code, next_order_gap;

And here are the results:

Image for post

There are just so many occasions where this function might be useful. Also, it’s nice to know the results can be obtained from the database directly, so we don’t have to do this calculation manually with Python/R.

Let’s proceed with the next one.


LEAD() and LAG()

LEAD() function, as the name suggests, fetches the value of a specific column from the next row and returns the fetched value in the current row. In PostgreSQLLEAD() takes two arguments:

  • column_name from which the next value has to be fetched
  • index of the next row relative to the current row.

LAG() is just the opposite of. It fetches values from the previous rows.

Let’s answer the following question to make this concept a bit more clear — what is the last highest amount for which an order was sold by an agent?

SELECT agent_code, ord_amount, LAG(ord_amount, 1) OVER (
PARTITION BY agent_code
ORDER BY ord_amount DESC
) last_highest_amount
FROM orders
ORDER BY agent_code, ord_amount DESC;

And here are the results:

Image for post

Above you can see how last_highest_amount clearly shows data per agent — that’s why there’s no result for agent A001, and first values for other agents are NULL.


RANK() and DENSE_RANK()

RANK() and DENSE_RANK() are numbering functions. They assign an integer value to a row depending upon the partition and the ordering. I cannot stress enough on the importance of these functions when it comes to finding the nth highest/lowest record from the table.

DENSE_RANK() and RANK() differ on the point that in the former we get consecutive ranks while in the later the rank after a tie is skipped. For example, ranking using DENSE_RANK() would be something like (1,2,2,3) whereas ranking using RANK() would be (1,2,2,4). Hope you get the difference.

Anyhow, let’s answer the following question with the help of these functions — what are the second highest order values for each month?

SELECT * FROM (
SELECT ord_num, ord_date, ord_amount, DENSE_RANK() OVER(
PARTITION BY DATE_PART(‘month’, ord_date)
ORDER BY ord_amount DESC) order_rank
FROM orders
) t
WHERE order_rank = 2
ORDER BY ord_date;

And here are the results:

Image for post

Cool! Let’s proceed with the next one.


CUME_DIST()

CUME_DIST() function is used to calculate the cumulative distribution of values within a given partition. It computes the fraction of rows in the partition that is less than or equal to the current row. It’s very helpful when we have to fetch only the top n% of the results.

Let’s use it to calculate the revenue percentile for each order in August and September:

SELECT DATE_PART(‘Month’,ord_date), agent_code, ord_amount, CUME_DIST() OVER(
PARTITION BY DATE_PART(‘Month’,ord_date)
ORDER BY ord_amount
)
FROM orders
WHERE ord_date BETWEEN ‘2008–08–01’ AND ‘2008–09–30’;

And here are the results:

Image for post

It’s not the function I use on a daily basis, but it’s nice to know it exists.


Before you go

And there you have it — 5 most common analytical functions I use when performing analysis in the database. It’s not as common for me as doing analysis with Python and Pandas, but I still find this useful from time to time — especially for analysts limited only to SQL.

I hope this 5 will suit you well, and feel free to research and study more on your own. Thanks for reading.


Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Leave a reply

Your email address will not be published. Required fields are marked *

More in Data Science