This article was originally published on Towards Data Science on February 16th, 2020.
It’s been a couple of weeks now since Pandas version 1.0 was released and it’s brought some new exciting features, some of them still being experimental though.
The main idea behind this article is to briefly explore them, and also show you how to upgrade to Pandas 1.0 because you’re probably still running version 0.25. So without much ado, let’s see how to do so.
Upgrade from 0.25 to 1.0
If I were to open up Jupyter Notebook and check out the current version of Pandas library, I’d get this:
I’m on version 0.25.1, so to upgrade I’d need to execute the following command (can also be done through Jupyter, no need to open up a terminal or command prompt):
If you, however, get some permission errors, run the command line as an administrator, or prefix the command with
sudo if you’re on Linux. Now let’s see if everything went as expected:
Keep in mind that you might not see ‘1.0.1’, depending on when you’re reading the article, it might be an even newer version. Now when that’s done, let’s briefly discuss the dataset I’ll use to make demonstrations, and also where to get it.
I’ve decided to use the MT Cars dataset, a dataset familiar to any Ruser, great for some basic data manipulation. It can be download from this link, or you can just copy the link and read it directly through Pandas:
Okay, now when that’s out of the way, I can begin with the article!
1. Markdown Conversion
Now you can easily produce markdown tables as a result of a Pandas DataFrame operation. You will need one additional library though, called
To demonstrate, I decided to keep only the cars with 6 cylinders, and only the
It’s now easy to export this result to the markdown table. Simply add
.to_markdown() at the end of a statement, and make sure to surround everything with a
Now you can copy this text and past it in a markdown cell (select a cell, and then
And voila! The cell now contains a markdown table from your data:
Pandas finally got a scalar for representing missing values. The idea behind it is to have a scalar for representing missing values consistent through all data types. Until now we had:
np.nan— for floats
None— for objects
pd.NaT— for date and time
Now we can represent missing values with
pd.NA instead of those mentioned above. It’s worth mentioning that it is still an experimental feature, and can change its behavior without prior warnings, so I wouldn’t include it in a production code just yet.
Nevertheless, that doesn’t mean we can play around with it. To quickly see it action, just execute
pd.NA in any Jupyter cell:
To see how it works on a basic example, I will create a series of a couple of elements and will declare one of them as missing, firstly with the
You can see how
None was ‘translated’ to
NA without any issues. You can also directly specify missing values via
Now from here, you can use your standard methods to see if a Series contains missing values:
And that’s pretty much it for now. Once again, the feature is experimental, so I don’t advise using it in production code.
String Data type
Until now we only had
object datatype to deal with anything not numeric, and it could be problematic for several reasons:
- You could store both strings and non-strings into a single
- No clear way in extracting only string columns, since
objectdatatype can result in the first discussed problem
pd.NA this is still considered experimental, meaning it is prone to change without warning.
Let’s take a look back at our MT Cars dataset, and see which datatypes we have stored in it:
As we can see,
manufacturer which is clearly a string is declared as
object. It might not sound like a big deal to you, but just imagine a scenario where you want to subset your dataset, keeping only columns of string datatype. There you could easily get a numerical column, with only one value that is non-numerical. And that’s a problem we want to avoid.
Luckily, we can use
.convert_dtypes() method to convert potential objects to strings:
Now I can be more secure when making a subset based on the column datatype:
Before you go
In the end, I just hope you’ve managed to get something useful from the article. None of the new features are groundbreaking, but that doesn’t mean they can’t help you with writing better, more uniform code.
I’m eager to see which features the developer team behind Pandas will add down the road, so stay tuned if you want to find about them as soon as they are released.
Thanks for reading.