This article was originally published on Towards Data Science on September 20th, 2020.
Pandas is an awesome library, but it’s not the fastest. There are ways of making it faster though, and that is by using the right tool for the right task. Today we’ll explore one of these tools, and it will make everyday tasks significantly faster.
There’s no point in making the introduction section any longer, so let’s take a quick look at the article structure and proceed with the fun stuff. The article is split into several sections:
- Dataset and problem overview
- The no-go solution
- The go-to solution
So without much ado, let’s get started!
Dataset and problem overview
We’ll need two libraries for the demonstration, and those are Numpy and Pandas. The dataset is completely made up and shows sales for 3 products on a particular date. Dates are split into 3 columns, just to make everything a bit harder and heavier for the computer.
Anyway, here’s the dataset:
Nothing special here, but we have a decent amount of data — 100K rows. Here’s how the first couple of rows look like:
We know how many items were sold on a particular day, but we don’t know the unit price — so let’s declare that quickly:
Awesome! Here’s what our end goal is (for every row):
Dayinto a single variable
- Calculate the profit by day — by multiplying the unit price with the amount sold
- Append those two variables as a key-value pair to a list
We also want to do this fast. There are multiple ways we can approach this problem, but only one of them is optimal.
Let’s explore what not to do before we announce the go-to approach for this type of task.
The no-go solution
We have quite a bit of work to do, as discussed in the previous section. That doesn’t mean it should require a lot of time for the computer to finish. You’d be surprised how quickly this can be done.
But first, let’s explore the worst of two options, and that is by using the
iterrows function to iterate through DataFrame rows and perform calculations.
Here’s the code:
So, we’ve completed all of the tasks described in the previous sections, and it took almost 8 seconds to complete.
You might think that it isn’t that bad, but just wait until you see what we’ll do next.
The go-to solution
If you think that
iterrows is good, wait until you meet the
itertuples. It’s a similar function, used to iterate through DataFrame rows, but it does it so much faster.
We’ll perform the same task and compare the execution times. The
itertuples works with, well, tuples, so we won’t be able to access the DataFrame values using the bracket notation. That’s the only difference.
Anyway, here’s the code:
This took only 0.218 seconds to complete! That is a 35x decrease and is quite significant. Maybe not so much in this toy example, as 8 seconds isn’t that much of a time to wait, but this scales easily to millions or tens of millions of rows.
Remember — always use
itertuples when performing tasks similar to this one.
Before you go
This is the second time I’ve covered this topic, but this time the dataset and the problem tasks are much heavier for the computer and for the developer to code out, so I find it worth sharing.
Don’t take this 35x speed improvement as something guaranteed, as the results may vary, depending on the type of the task and your hardware. Nevertheless,
iteruples should outperform
iterrows every time. That’s what matters.