This article was originally published on Towards Data Science on August 27th, 2020.

Analytical functions are one of the most popular tools among BI/Data analysts for performing complex data analysis. These functions perform computations over multiple rows and return the multiple rows as well. Today we’ll cover 5 functions I find most useful, with a lot of practical examples.

For the uninitiated, the relative size of a query with an analytical function might seem a bit intimidating. Don’t worry, we have you covered. Most of these functions follow a basic syntax:

analytic_function_name([argument_list])

OVER (

[PARTITION BY partition_expression,…]

[ORDER BY sort_expression, … [ASC|DESC]])

There are three parts to this syntax, namely *function*, *partition by* and *order by*. Let’s briefly cover what each one does:

`analytic_function_name`

: name of the function — like`RANK()`

,`SUM()`

,`FIRST()`

, etc`partition_expression`

: column/expression on the basis of which the partition or window frames have to be created`sort_expression`

: column/expression on the basis of which the rows in the partition will be sorted

Okay, we’ve covered the basics thus far. For the practical part we’re gonna use the Orders table stored inside the *PostgreSQL* database:

Let’s begin with the practical part now, shall we?

### AVG() and SUM()

We’ve all been using aggregate functions such as `SUM`

, `AVG`

, `MIN`

, `MAX`

, and `COUNT`

in our `GROUP BY`

clauses. But when these functions are used over an `ORDER BY`

clause they can give us running sum, mean, total, etc.

The following example will make it a lot more clear — *we want to calculate the running average revenue and total revenue for each agent in the third quarter:*

SELECT ord_date, agent_code, AVG(ord_amount) OVER (

PARTITION BY agent_code

ORDER BY ord_date

) running_agent_avg_revenue,

SUM (ord_amount) OVER (

PARTITION BY agent_code

ORDER BY ord_date

) running_agent_total_revenue

FROM orders

WHERE ord_date BETWEEN ‘2008–07–01’ AND ‘2008–09–30’;

And here are the results:

Awesome! These functions are simple and require no additional explanations. Let’s proceed.

### FIRST_VALUE(), LAST_VALUE(), and NTH_VALUE()

`FIRST_VALUE()`

is an analytical function that returns the value of the specified column from the first row of the window frame. If you‘ve understood the previous sentence, `LAST_VALUE()`

is self-explanatory. It fetches the value from the last row.

*PostgreSQL* provides us with one more additional function called `NTH_VALUE(column_name, n)`

that fetches the value from the n-th row. Isn’t it great? No more complex self joins.

Let’s answer the following question —* How many days after the first purchase of a customer was the next purchase made?*

SELECT cust_code, ord_date, ord_date — FIRST_VALUE(ord_date) OVER (

PARTITION BY cust_code

ORDER BY ord_date) next_order_gap

FROM orders

ORDER BY cust_code, next_order_gap;

And here are the results:

There are just so many occasions where this function might be useful. Also, it’s nice to know the results can be obtained from the database directly, so we don’t have to do this calculation manually with Python/R.

Let’s proceed with the next one.

### LEAD() and LAG()

`LEAD()`

function, as the name suggests, fetches the value of a specific column from the next row and returns the fetched value in the current row. In *PostgreSQL*, `LEAD()`

takes two arguments:

`column_name`

from which the next value has to be fetched`index`

of the next row relative to the current row.

`LAG()`

is just the opposite of. It fetches values from the previous rows.

Let’s answer the following question to make this concept a bit more clear — *what is the last highest amount for which an order was sold by an agent?*

SELECT agent_code, ord_amount, LAG(ord_amount, 1) OVER (

PARTITION BY agent_code

ORDER BY ord_amount DESC

) last_highest_amount

FROM orders

ORDER BY agent_code, ord_amount DESC;

And here are the results:

Above you can see how `last_highest_amount`

clearly shows data per agent — that’s why there’s no result for agent A001, and first values for other agents are NULL.

### RANK() and DENSE_RANK()

`RANK()`

and `DENSE_RANK()`

are numbering functions. They assign an integer value to a row depending upon the partition and the ordering. I cannot stress enough on the importance of these functions when it comes to finding the nth highest/lowest record from the table.

`DENSE_RANK()`

and `RANK()`

differ on the point that in the former we get consecutive ranks while in the later the rank after a tie is skipped. For example, ranking using `DENSE_RANK()`

would be something like (1,2,2,3) whereas ranking using `RANK()`

would be (1,2,2,4). Hope you get the difference.

Anyhow, let’s answer the following question with the help of these functions — *what are the second highest order values for each month?*

SELECT * FROM (

SELECT ord_num, ord_date, ord_amount, DENSE_RANK() OVER(

PARTITION BY DATE_PART(‘month’, ord_date)

ORDER BY ord_amount DESC) order_rank

FROM orders

) t

WHERE order_rank = 2

ORDER BY ord_date;

And here are the results:

Cool! Let’s proceed with the next one.

### CUME_DIST()

`CUME_DIST()`

function is used to calculate the cumulative distribution of values within a given partition. It computes the fraction of rows in the partition that is less than or equal to the current row. It’s very helpful when we have to fetch only the top n% of the results.

Let’s use it to *calculate the revenue percentile for each order in August and September*:

SELECT DATE_PART(‘Month’,ord_date), agent_code, ord_amount, CUME_DIST() OVER(

PARTITION BY DATE_PART(‘Month’,ord_date)

ORDER BY ord_amount

)

FROM orders

WHERE ord_date BETWEEN ‘2008–08–01’ AND ‘2008–09–30’;

And here are the results:

It’s not the function I use on a daily basis, but it’s nice to know it exists.

### Before you go

And there you have it — 5 most common analytical functions I use when performing analysis in the database. It’s not as common for me as doing analysis with Python and Pandas, but I still find this useful from time to time — especially for analysts limited only to SQL.

I hope this 5 will suit you well, and feel free to research and study more on your own. Thanks for reading.

**Join my private email list for more helpful insights.**