This article was originally published on Towards Data Science on February 24th, 2020.
When it comes to data science or data analysis, Python is pretty much always the language of choice. Its library Pandas is the one you cannot, and more importantly, shouldn’t avoid.
While Pandas by itself isn’t that difficult to learn, mainly due to the self-explanatory method names, having a cheat sheet is still worthy, especially if you want to code out something quickly. That’s why today I want to put the focus on how I use Pandas to do Exploratory Data Analysis by providing you with the list of my most used methods and also a detailed explanation of those.
I don’t want to dwell too much with the intro, so I’ll just quickly go over the dataset used and then we’ll jump into the good stuff.
There’s no need to use complex datasets to demonstrate simple ideas, so with that in mind, I decided to use the Iris dataset. If you decide to follow along with the code you can find the dataset on this link.
Here’s how to import the Pandas library and load in the dataset:
Okay, let’s not waste any more time and see how the EDA process typically looks like, Pandas-wise.
1. head(), tail(), and sample()
I decided to put those 3 into the same bucket because the main idea behind them is the same — and that’s to see how the data looks like without visualizing it.
Without any kind of doubt, I can say that these 3 methods are among the first one I use, with the focus on the last one, because of both
tail() might be misleading if you’re dealing with sequential data.
Anyhow, here’s how to use them on our dataset:
With the last one, there’s no guarantee you’ll get the same results — due to random effect — nothing wrong there.
There’s not much more to say about these three methods, use them to get a first look at your data, but for nothing more.
If there’s one thing you do over and over again in the process of exploratory data analysis — that’s performing a statistical summaryfor every (or almost every) attribute.
It would be a quite tedious process without the right tools — but thankfully Pandas is here to do the heavy lifting for you. The
describe()method will do a quick statistical summary for every numerical column, as shown below:
As this is a Pandas DataFrame in a nutshell, there’s no one forcing you to keep every statistical value. Here’s how you’d go about keeping only the mean, median, and standard deviation:
Note how I’m using the transpose operator to switch from columns to rows, and vice-versa.
3. nsmallest() and nlargest()
I’m guessing there’s no doubt about the purposes of these two methods after just reading their names, but nevertheless, they can prove to be worthy in the process of exploratory data analysis.
I use them often after conducting a statistical summary of the dataset, to check if some attribute of interest contains extremes or is fairly static on the other hand.
Let’s see how we’d go about finding the 5 observations with the smallest value of
And on the other end of the spectrum, here’s how to find 5 elements with the largest value of
And that’s pretty much it — nothing more to say here.
Missing values are a big part of data analysis, there’s no doubt about it. It’s very unlikely that every observation will be available at each point of time, and the reasons for it can be many — from some domain-specific reason to simple human error.
A part of every Exploratory data analysis process is, therefore, testing for missing values and figuring out how to deal with them.
Pandas library has you covered here also. We can check for missing values on entire DataFrame:
Now the Iris dataset is completely free of missing values, but that doesn’t mean we can explore further.
Because of how Python evaluates True as 1 and False as 0 we can use the
sum() function to get some concrete numbers instead of this DataFrame of booleans from above:
Now, this is much more pleasant to look at. There’s nothing stopping us from using the
sum() function again, if we want to check for the total number of missing values in the entire dataset, rather than in particular column:
If you’re looking for a method capable of telling you how many observations are there for a single possible option of an attribute, look no further.
value_counts() method can do all of that for you, but also more with the option to account for missing values, or even to display results as a percentage.
Now you wouldn’t use this method on the entire DataFrame, but on a single attribute instead. Here’s an example:
Another important idea reveals here — it’s best to use this method on a categorical attribute, that is if you care about the concise output. If you’re not that lucky you can refer to methods like
cut() for transforming continuous variables into categorical.
If your attribute contains missing values,
value_counts() won’t account for them automatically — you need to specify that:
Similarly, sometimes it’s nice to get the percentages returned instead of plan integers — makes it easier to present results to someone:
Before you go
I hope this article will point you in the right direction when it comes to Python, Pandas, and Exploratory Data Analysis in general.
Personally, I don’t use many more methods in real projects, but the trick is to do proper data manipulation and transformation beforehand. Analysis itself then becomes trivial.
Thanks for reading.