This article was originally published on Towards Data Science on January 31st, 2020.
For some time now I’ve been questioning Python’s ability to do stuff fast. Let’s face it, there’s a lot of trash talk about Python’s speed when compared to other languages — like C or Go.
Now, I’ve tried to do data science in Go — and it’s possible — but not even remotely pleasant like in Python, mostly due to the static nature of the language and data science being mostly exploratory field. I’m not saying that you can’t benefit performance-wise by rewriting the finished solution in Go, but that’s topic for another article.
What I’ve neglected thus far, to say at least is Python’s ability to do stuff faster. I’ve been suffering from tunnel vision — a syndrome when you see only one solution and neglect the existence of others. And I’m sure I’m not alone in this.
That’s the reason why today I want to briefly cover how to make everyday Pandas work much faster and pleasant. To be more precise, the example will focus on iteration through rows, and doing some data manipulation in the process. So without further ado, let’s jump into the good stuff.
Let’s make a Dataset
The simplest way to drive a point home will be to declare a single-column Data Frame object, with integer values ranging from 1 to 100000:
We really won’t need anything more complex to address Pandas speed issues. To verify everything went well, here are the first couple of rows and the overall shape of our dataset:
Okay, enough with the preparation, let’s now see how to, and how not to iterate through rows of a Data Frame. First, we’ll cover how not to option.
Here’s what you should Not do
Ah, the method I’ve been guilty of using (and overusing) so much —
iterrows(). It’s slow as hell by default, but you know, why should I bother to seek alternatives (tunnel vision).
To prove that you shouldn’t use the
iterrows() method to iterate over Data Frame, I’ll do a quick example — declare a variable and set it to 0 initially — then increment it by the current value of the
Values attribute upon each iteration.
In case you’re wondering,
%%time magic function will return the number of seconds/milliseconds it took for a cell to finish all operations.
Let’s see this in action:
Now you might be thinking that 15 seconds isn’t that much to go over 100000 rows and increment some outer variable’s value. But it actually is — let’s see why in the next section.
Here’s what you should do
Now comes in a magical method to the rescue —
itertuples(). As the name suggests,
itertuples() loops through rows of a Data Frame and returns a named tuple. That’s why you won’t be able to access the values with a bracket notation
, but will instead need to use the dot
I will now demonstrate the same example as I did a couple of minutes earlier, but with
And voila! To do the same calculations
itertuples() was around 154 times faster! Now imagine your everyday scenario at work, where you’re processing a couple of million rows —
itertuples() can save you so much time there.
Before you go
In this trivial example, we’ve seen how tweaking your code for just a little bit can have a tremendous impact on the overall result.
It doesn’t mean that in every scenario
itertuples() will be faster than
iterrows() 150 times, but it sure means it will be faster to some degree every single time.
Thanks for reading, I hope you’ve liked it.