Data Preparation

Top 3 Pandas Functions I Wish I Knew Earlier

This article was originally published on Towards Data Science on April 9th, 2020.

With data science being so broad field, there’s no chance one can master every language and every library — and after being in the industry for this long there’s only so much I know. Constant learning is what keeps me in the game, and after taking a look back knowing the functions from this article would be a huge time (and nerve) saver.

Some of them are purely a function, but some, on the other hand, refers to the way you use Pandas, and why one approach is better than the other.

So let’s start with the first one — it may surprise you how efficient it is.


itertuples()

Wait, what? Yes, this one isn’t a function per-say, it refers to the more efficient way of using Pandas, ergo the faster way of iterating through a dataset.

Now, before you give a hard time in the comment section, I know that there are more efficient ways for summing up the column values, but I’m doing it this way just to make a point.

We’re gonna declare a simple dataset with only a single column, containing a range of numbers from 1 to 1 million. Here’s how:

df = pd.DataFrame(data={
   'Number': range(1, 1000000)
})

Here’s how the first couple of rows look like:

Now let’s do things the wrong way. We’re gonna declare a variable totaland set it to 0. Then, by using iterrows() we’ll iterate over the dataset and increment total by the value of the current row. And also, we will measure the time. Here’s the code:

%%time

total = 0
for _, row in df.iterrows():
    total += row['Number']
 
total

>>> Wall time: 18.7 s

Almost 19 seconds for this trivial operation — but there’s a better way. Let’s now do the same, but with iteruples instead of iterrows:

%%time

total = 0
for row in df.itertuples(index=False):
    total += row.Number
 
total

>>> Wall time: 82.1 ms

I won’t do the calculations, but the speed improvement is significant. Remember this one next time when performing a loop.


nlargest() and nsmallest()

Just yesterday I was computing a distance in kilometers from two latitude/longitude pairs. That was the first part of the problem — the second one was selecting top N records with the smallest distance.

Enter — nsmallest().

As the name suggests, nlargest() will return N largest values, and nsmallest() will do just the opposite.

Let’s see it in action. For the practical part I’ve prepared a small dataset:

df = pd.DataFrame(data={
    'Name': ['Bob', 'Mark', 'Josh', 'Anna', 'Peter', 'Dexter'],
    'Points': [37, 91, 66, 42, 99, 81]
})

And here’s how it looks like:

Now let’s say this dataset doesn’t contain 6 rows and contains 6000 instead, and you wish to find which students performed the best — ergo they had the greatest number of points. One way to do so would be like this:

df['Points'].nlargest(3)

Not an optimal solution, because it would result in this:

This is not good because you don’t have a clear picture of the actual names. Here’s how to improve:

df.nlargest(3, columns='Points')

Now the results are much more satisfying:

You can implement almost the same logic to find 3 students who performed the worst — with the nsmallest() function:

df.nsmallest(3, columns='Points')

Here’s the output:

And now let’s proceed to the last function.


cut()

To demonstrate the capabilities of this function we’ll be using the dataset from the previous section. To recap, here it is:

df = pd.DataFrame(data={
    'Name': ['Bob', 'Mark', 'Josh', 'Anna', 'Peter', 'Dexter'],
    'Points': [37, 91, 66, 42, 99, 81]
})

The basic idea behind the cut() function is binning values into discrete intervals. Here’s the most simple example — we’ll create two bins from the Points attribute:

pd.cut(df['Points'], bins=2)

Not something too useful for now. But how about declaring the first bin going from 0 to 50, and the second one from 50 to 100? Sounds like a plan. Here’s the code:

pd.cut(df['Points'], bins=[0, 50, 100])

But still, let’s say that you want to display Fail instead of (0, 50] and Pass instead of (50, 100]. Here’s how to do so:

pd.cut(df['Points'], bins=[0, 50, 100], labels=['Fail', 'Pass'])

Now that’s something.


Before you go

If you are just starting out, this function will help you save both time and nerves. If you’re not, reading this article will help you reinforce the knowledge of the existence of these functions — because it’s easy to forget about them and write the logic from scratch.

But there’s no point in doing so.

I hope you’ve liked it. Thanks for reading.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Comments are closed.