Data SciencePython

This Function Can Make Your Pandas Code Significantly Faster

This article was originally published on Towards Data Science on September 20th, 2020.

Pandas is an awesome library, but it’s not the fastest. There are ways of making it faster though, and that is by using the right tool for the right task. Today we’ll explore one of these tools, and it will make everyday tasks significantly faster.

There’s no point in making the introduction section any longer, so let’s take a quick look at the article structure and proceed with the fun stuff. The article is split into several sections:

  • Dataset and problem overview
  • The no-go solution
  • The go-to solution
  • Conclusion

So without much ado, let’s get started!

Dataset and problem overview

We’ll need two libraries for the demonstration, and those are Numpy and Pandas. The dataset is completely made up and shows sales for 3 products on a particular date. Dates are split into 3 columns, just to make everything a bit harder and heavier for the computer.

Anyway, here’s the dataset:

Image for post

Nothing special here, but we have a decent amount of data — 100K rows. Here’s how the first couple of rows look like:

Image for post

We know how many items were sold on a particular day, but we don’t know the unit price — so let’s declare that quickly:

Image for post

Awesome! Here’s what our end goal is (for every row):

  • Combine YearMonth, and Day into a single variable
  • Calculate the profit by day — by multiplying the unit price with the amount sold
  • Append those two variables as a key-value pair to a list

We also want to do this fast. There are multiple ways we can approach this problem, but only one of them is optimal.

Let’s explore what not to do before we announce the go-to approach for this type of task.

The no-go solution

We have quite a bit of work to do, as discussed in the previous section. That doesn’t mean it should require a lot of time for the computer to finish. You’d be surprised how quickly this can be done.

But first, let’s explore the worst of two options, and that is by using the iterrows function to iterate through DataFrame rows and perform calculations.

Here’s the code:

Image for post

So, we’ve completed all of the tasks described in the previous sections, and it took almost 8 seconds to complete.

You might think that it isn’t that bad, but just wait until you see what we’ll do next.

The go-to solution

If you think that iterrows is good, wait until you meet the itertuples. It’s a similar function, used to iterate through DataFrame rows, but it does it so much faster.

We’ll perform the same task and compare the execution times. The itertuples works with, well, tuples, so we won’t be able to access the DataFrame values using the bracket notation. That’s the only difference.

Anyway, here’s the code:

Image for post

This took only 0.218 seconds to complete! That is a 35x decrease and is quite significant. Maybe not so much in this toy example, as 8 seconds isn’t that much of a time to wait, but this scales easily to millions or tens of millions of rows.

Remember — always use itertuples when performing tasks similar to this one.

Before you go

This is the second time I’ve covered this topic, but this time the dataset and the problem tasks are much heavier for the computer and for the developer to code out, so I find it worth sharing.

Don’t take this 35x speed improvement as something guaranteed, as the results may vary, depending on the type of the task and your hardware. Nevertheless, iteruples should outperform iterrows every time. That’s what matters.

Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Comments are closed.

More in Data Science