PythonSQL

SQL vs. Pandas — Which one to choose in 2020? Part 2

This article was originally published on Towards Data Science on August 21st, 2020.

SQL and Pandas aren’t new technologies. Still, it’s not the easiest task to find corresponding functions for both technologies. That’s where this one and the previous article come into play — providing you with a detailed comparison between the two.

A couple of days back, I’ve covered the first part of this two-part series, dealing with more simple comparisons between the two technologies:

Reading that article first is not a prerequisite, but will definitely help you to get a better understanding of the two. The technologies aren’t designed for the same job, but it’s nice to see corresponding functions between the two. As promised, today we’ll cover more advanced topics:

  • Joins
  • Unions
  • Groupings

Before we do so, let’s start simple with the delete statements.


Delete

DELETE statement is used in SQL to delete or remove a row from the table. The syntax for deleting rows in SQL is as follows:

DELETE FROM table_name
WHERE condition;

Deleting a row is slightly different in Pandas. In Pandas, we do not delete a row, we just select the part that we require and discard the rest of it. Don’t worry if it seems like a riddle to you, the example will illustrate it further.

Let’s say we want to delete all the records from the Asian region.

SQL

DELETE FROM fert_data
WHERE region = ‘Asia’;
Image for post

The rows have been successfully deleted. Now let’s see how to perform this task in Pandas.

Pandas

df = df.loc[df[‘region’] != ‘Asia’]

Here, we have selected all the rows where the region is not ‘Asia’ and then assigned the resultset to our current data frame. That means we have excluded all the rows where the region was ‘Asia’.


Joins

JOINS are used in SQL to join or merge two or more tables together based on a specific condition. There are primarily four types of joins in SQL: LEFTRIGHTINNERFULL. Here is the syntax for JOIN :

SELECT *
FROM table_name_1 as t1 JOIN table_name_2 as t2
ON t1.column_name_1 = t2.column_name_2;

In Pandas, we can join two or more data frames using merge()By default, it will perform an inner join. But you can customize it using the howargument to perform other joins. The basic syntax for pd.merge() is as follows :

merge(left_df, right_df, how=’inner’, on=condition)

Here is an example to illustrate joins.

SQL

Given below is a table called country_sub_region. We have to join this table with fert_data using an inner join.

SELECT country, sub_region
FROM country_sub_region;
Image for post
SELECT * FROM
fert_data as f INNER JOIN country_sub_region as c
ON f.country = c.country;
Image for post

The tables have been successfully joined. Let’s see how to join them in Pandas.

Pandas

Here we have created a data frame similar to the country_sub_region table:

country_sub_reg = data = [
[‘country’, ’subregion’],
[‘Kenya’, ’East Africa’],
[‘Liberia’, ’West Africa’],
[‘Mali’, ’West Africa’]
]df_sr = pd.DataFrame(country_sub_reg[1:],columns=country_sub_reg[0])
Image for post

We will merge df_sr with df on the country field using an inner join:

pd.merge(df, df_sr, on=’country’, how=’inner’)
Image for post

Union

UNION operator is used to club together the results of two or more SELECTstatements in SQL. There is a comrade to the Union operator called UNION ALL. They differ in the sense that the former removes duplicate values from the combined result.

The task of a UNION ALL operator in Pandas can be performed using pd.concat(). While the function of the UNION operator can be performed by first concatenating the data frames using pd.concat() and then applying pd.drop_duplicates() on it.

SQL

In order to illustrate the UNION/UNION ALL operator in SQL, we have created an additional table called fert_data_1. The data in this table looks something as follows:

Image for post

Our task is as follows — find the union of rows from fert_data and fert_data_1 table:

SELECT * FROM fert_data_1
UNION ALL
SELECT * FROM fert_data
ORDER BY country;
Image for post

You will observe that there are some duplicate values. Yes, you guessed it right. You can use the UNION operator to remove them. Try it for yourself.

Pandas

In Pandas, we have created a data frame that is similar to the fert_data_1table in SQL.

data = [
[‘country’, ’region’, ’tfr’, ’contraceptors’],
[‘USA’, ’North.Amer’, 1.77, 20],
[‘UK’, ’Europe’, 1.79, 23],
[‘Bangladesh’, ’Asia’, 5.5, 40],
[‘Thailand’, ’Asia’, 2.3, 68]
]df1 = pd.DataFrame(data[1:], columns=data[0])
Image for post

Union of df and df1:

df_dupli = pd.concat([df1, df])
Image for post

The data from data frames have been combined. But, in this case, we will get duplicate rows as well. For example, the goal is to have ‘Bangladesh’ listed only once:

df_dupli[df_dupli[‘country’] == ’Bangladesh’]
Image for post

We can drop duplicate records using drop_duplicates() as shown:

df_wo_dupli = pd.concat([df1, df]).drop_duplicates()

Let’s run the same query and see if we still get two rows.

df_wo_dupli[df_wo_dupli[‘country’] == ‘Bangladesh’]
Image for post

Problem solved. No more duplicate rows.


Group by

GROUP BY clause in SQL is used to prepare summary rows by grouping records together. The clause is usually used in conjugation with aggregate functions such as AVG, SUM, COUNT, MIN, MAX, etc. Here is the basic syntax for GROUP BY clause:

SELECT column_name_1, agg_func(column_name_2)
FROM table_name
GROUP BY column_name_1;

In Pandas, we have a groupby() function that helps us in summarizing data along a specific column. The generic syntax is as follows:

df.groupby([‘column_name_1’]).agg_function()

Let’s try an example to understand it better — find the average tfr and count of contraceptors field for each region.

SQL

SELECT region, round(avg(tfr),2), count(contraceptors)
FROM fert_data
GROUP BY region;
Image for post

Pandas

df.groupby(‘region’).agg({‘tfr’: np.mean, ‘contraceptors’: np.size}).round(2)
Image for post

We got the same results from both the queries. You must be wondering what that agg() in Pandas is used for. It used to aggregate one or more operations over a specified axis.


Before you go

And this does it — you should now have a good picture behind both technologies, at least data-analysis-wise. It’s difficult to recommend one over the other, as that will depend on your previous experience, biases, and options company you work in opted for.

The good thing is — everything done in SQL can be done in Pandas — at least on this level. Feel free to choose the one you like better, you won’t make a mistake.

Thanks for reading.


Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Leave a reply

Your email address will not be published. Required fields are marked *

More in Python