When it comes to data science libraries in Python,
Pandas is probably the most useful one. It’s been around for quite some time and includes a lot of time-saving functions. This article covers three lesser-known
Pandas functions in a practical manner.
Don’t feel like reading? Check out my video on the topic:
You’ll need a Housing prices dataset to follow along, so please download it from this link. Let’s load it in with Pandas:
import pandas as pd df = pd.read_csv('HousingPrices.csv') df.head()
Here’s how the dataset looks like:
Without much ado, let’s jump into function number one.
select_dtypes function is used to select only the columns of a specific data type. Just imagine you want to do some work on strings – you can use the mentioned function to make a subset of non-numeric columns and perform the operations from there.
Our dataset doesn’t contain string columns, as visible from the image below:
We’ll use the
select_dtypes function to keep only integer columns. According to the previous image, only the
NumStores columns should remain:
And here are the results:
Let’s proceed with the next function.
memory_usage function does just what the name suggests – reports back the memory usage. By default, the reporting is done per column and in bytes, but we’ll see how to change it in a bit.
The function is called on the entire dataset:
The call results in the following series for our dataset:
If you want the total number instead, you can call the
sum() function on the result set:
Which would result in
23312 bytes. Since bytes aren’t as interpretable as megabytes, let’s make a conversion next:
df.memory_usage().sum() / 1e+6
The above code outputs
0.023312 when executed, representing how many megabytes of memory the housing dataset takes.
You can take this function a step further with logging to enable memory usage monitoring in production applications, but that’s a bit out of this article’s scope.
Let’s go over the last function.
query function is used to filter the dataset and to grab records of interest. The filtering format is what makes it special. There’s no need to manage tens of brackets if you are filtering by multiple conditions, as this function enables you to pass filter condition as a string.
Let’s take a look at a simpler example first – we are interested in houses that are between 20 and 30 years old:
df.query('20 < HouseAge <= 30').head()
Here are the results:
But what if we want to test for multiple conditions? Just put the
or keywords in between:
df.query('20 < HouseAge <= 30 and NumStores == 6').head()
This is the resulting subset:
And that’s all you need to know. Let’s wrap things up in the next section.
Pandas is among the top data science libraries in Python for a reason – there’s nothing you can’t do simply and intuitively. Things can get messy with complex filtering, but the newly learned
query functions should make things easier.
What are your favorite lesser-known
Pandas functions? Please let me know in the comment section below.
Thanks for reading.