Data PreparationData Science

Top 3 Lesser-Known Pandas Functions

When it comes to data science libraries in Python, Pandas is probably the most useful one. It’s been around for quite some time and includes a lot of time-saving functions. This article covers three lesser-known Pandas functions in a practical manner.

Don’t feel like reading? Check out my video on the topic:

You’ll need a Housing prices dataset to follow along, so please download it from this link. Let’s load it in with Pandas:

import pandas as pd

df = pd.read_csv('HousingPrices.csv')
df.head()

Here’s how the dataset looks like:

Housing prices dataset head

Image 1 – Housing prices dataset head (image by author)

Without much ado, let’s jump into function number one.

select_dtypes()

The select_dtypes function is used to select only the columns of a specific data type. Just imagine you want to do some work on strings – you can use the mentioned function to make a subset of non-numeric columns and perform the operations from there.

Our dataset doesn’t contain string columns, as visible from the image below:

Datatypes per column

Image 2 – Datatypes per column (image by author)

We’ll use the select_dtypes function to keep only integer columns. According to the previous image, only the TransactionDate and NumStores columns should remain:

df.select_dtypes('int64').head()

And here are the results:

Integer columns

Image 3 – Integer columns (image by author)

Let’s proceed with the next function.

memory_usage()

The memory_usage function does just what the name suggests – reports back the memory usage. By default, the reporting is done per column and in bytes, but we’ll see how to change it in a bit.

The function is called on the entire dataset:

df.memory_usage()

The call results in the following series for our dataset:

Memory usage per column

Image 4 – Memory usage in bytes per column (image by author)

If you want the total number instead, you can call the sum() function on the result set:

df.memory_usage().sum()

Which would result in 23312 bytes. Since bytes aren’t as interpretable as megabytes, let’s make a conversion next:

df.memory_usage().sum() / 1e+6

The above code outputs 0.023312 when executed, representing how many megabytes of memory the housing dataset takes.

You can take this function a step further with logging to enable memory usage monitoring in production applications, but that’s a bit out of this article’s scope. 

Let’s go over the last function.

query()

The query function is used to filter the dataset and to grab records of interest. The filtering format is what makes it special. There’s no need to manage tens of brackets if you are filtering by multiple conditions, as this function enables you to pass filter condition as a string.

Let’s take a look at a simpler example first – we are interested in houses that are between 20 and 30 years old:

df.query('20 < HouseAge <= 30').head()

Here are the results:

Filtered houses 1

Image 5 – Houses between 20 and 30 years old (image by author)

But what if we want to test for multiple conditions? Just put the and or or keywords in between:

df.query('20 < HouseAge <= 30 and NumStores == 6').head()

This is the resulting subset:

Filtered houses 2

Image 6 – Houses between 20 and 30 years old with 6 stores nearby (image by author)

And that’s all you need to know. Let’s wrap things up in the next section.

Parting words

Pandas is among the top data science libraries in Python for a reason – there’s nothing you can’t do simply and intuitively. Things can get messy with complex filtering, but the newly learned query functions should make things easier.

What are your favorite lesser-known Pandas functions? Please let me know in the comment section below.

Thanks for reading.

Join my private email list for more helpful insights.

 

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Leave a reply

Your email address will not be published. Required fields are marked *