Data Preparation

Top 3 New Features in Pandas 1.0

This article was originally published on Towards Data Science on February 16th, 2020.

It’s been a couple of weeks now since Pandas version 1.0 was released and it’s brought some new exciting features, some of them still being experimental though.

The main idea behind this article is to briefly explore them, and also show you how to upgrade to Pandas 1.0 because you’re probably still running version 0.25. So without much ado, let’s see how to do so.

Upgrade from 0.25 to 1.0

If I were to open up Jupyter Notebook and check out the current version of Pandas library, I’d get this:

I’m on version 0.25.1, so to upgrade I’d need to execute the following command (can also be done through Jupyter, no need to open up a terminal or command prompt):

If you, however, get some permission errors, run the command line as an administrator, or prefix the command with sudo if you’re on Linux. Now let’s see if everything went as expected:

Keep in mind that you might not see ‘1.0.1’, depending on when you’re reading the article, it might be an even newer version. Now when that’s done, let’s briefly discuss the dataset I’ll use to make demonstrations, and also where to get it.

Dataset used

I’ve decided to use the MT Cars dataset, a dataset familiar to any Ruser, great for some basic data manipulation. It can be download from this link, or you can just copy the link and read it directly through Pandas:

Okay, now when that’s out of the way, I can begin with the article!

1. Markdown Conversion

Now you can easily produce markdown tables as a result of a Pandas DataFrame operation. You will need one additional library though, called tabluate:

To demonstrate, I decided to keep only the cars with 6 cylinders, and only the manufacturer and mpg columns:

It’s now easy to export this result to the markdown table. Simply add .to_markdown() at the end of a statement, and make sure to surround everything with aprint() statement:

Now you can copy this text and past it in a markdown cell (select a cell, and then ESC — M — Enter:

And voila! The cell now contains a markdown table from your data:

NA Scalar

Pandas finally got a scalar for representing missing values. The idea behind it is to have a scalar for representing missing values consistent through all data types. Until now we had:

  • np.nan — for floats
  • None — for objects
  • pd.NaT — for date and time

Now we can represent missing values with pd.NA instead of those mentioned above. It’s worth mentioning that it is still an experimental feature, and can change its behavior without prior warnings, so I wouldn’t include it in a production code just yet.

Nevertheless, that doesn’t mean we can play around with it. To quickly see it action, just execute pd.NA in any Jupyter cell:

To see how it works on a basic example, I will create a series of a couple of elements and will declare one of them as missing, firstly with the None keyword:

You can see how None was ‘translated’ to NA without any issues. You can also directly specify missing values via pd.NA:

Now from here, you can use your standard methods to see if a Series contains missing values:

And that’s pretty much it for now. Once again, the feature is experimental, so I don’t advise using it in production code.

String Data type

Until now we only had object datatype to deal with anything not numeric, and it could be problematic for several reasons:

  1. You could store both strings and non-strings into a single object datatype array
  2. No clear way in extracting only string columns, since .select_dtypes() with objectdatatype can result in the first discussed problem

Just like pd.NA this is still considered experimental, meaning it is prone to change without warning.

Let’s take a look back at our MT Cars dataset, and see which datatypes we have stored in it:

As we can see, manufacturer which is clearly a string is declared as object. It might not sound like a big deal to you, but just imagine a scenario where you want to subset your dataset, keeping only columns of string datatype. There you could easily get a numerical column, with only one value that is non-numerical. And that’s a problem we want to avoid.

Luckily, we can use .convert_dtypes() method to convert potential objects to strings:

Now I can be more secure when making a subset based on the column datatype:

Before you go

In the end, I just hope you’ve managed to get something useful from the article. None of the new features are groundbreaking, but that doesn’t mean they can’t help you with writing better, more uniform code.

I’m eager to see which features the developer team behind Pandas will add down the road, so stay tuned if you want to find about them as soon as they are released.

Thanks for reading.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Comments are closed.