Data PreparationData ScienceMachine Learning

Principal Component Analysis (PCA) from scratch in Python

This article was originally published on Towards Data Science on June 20th, 2020.

Principal Component Analysis is a mathematical technique used for dimensionality reduction. Its goal is to reduce the number of features whilst keeping most of the original information. Today we’ll implement it from scratch, using pure Numpy.

If you’re wondering why PCA is useful for your average machine learning task, here’s the list of top 3 benefits:

  • Reduces training time — due to smaller dataset
  • Removes noise — by keeping only what’s relevant
  • Makes visualization possible — in cases where you have a maximum of 3 principal components

The last one is a biggie — and we’ll see it in action today.

But why is it a biggie? Good question. Imagine that you have a dataset of 10 features and want to visualize it. But how? 10 features = 10 physical dimensions. We as humans kind of suck when it comes to visualizing anything above 3 dimensions — hence the need for dimensionality reduction techniques.

I want to make one important note here — principal component analysis is not a feature selection algorithm. What I mean is that principal component analysis won’t give you the top N features like for example forward selection would do. Instead, it will give you N principal components, where N equals the number of original features.

If that sounds confusing, I strongly recommend you watch this video:

The video dives deep into theoretical reasoning and explains everything much better than I’m capable of.

Our agenda for today is as follows:

  • Load the dataset
  • Perform PCA
  • Make awesome visualizations

So without much ado, let’s dive in.


Dataset and imports

I want everything to be super simple here, so I’ve decided to go with the well-known Iris dataset. It initially has only 4 features — still impossible to visualize. We’ll address this visualization issue after applying PCA.

Here are the imports and dataset loading:

import numpy as np 
import pandas as pddf = pd.read_csv(‘https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df.head()

Executing the code above should result with the following data frame:

Let’s continue with the PCA itself.


PCA step by step

Here is the short summary of the required steps:

  1. Scale the data — we don’t want some feature to be voted as “more important” due to scale differences. 10m = 10000mm, but the algorithm isn’t aware of meters and millimeters (sorry US readers)
  2. Calculate covariance matrix — square matrix giving the covariances between each pair of elements of a random vector
  3. Eigendecomposition — we’ll get to that

So let’s start with the first (and easiest) one.

Data scaling

I’ve briefly touched on the idea of why we need to scale the data, so I won’t repeat myself here. Think of it as a necessary prerequisite — not only here, but for any machine learning task.

To perform the scaling we’ll use the StandardScaler from Scikit-Learn:

from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
X_scaled[:5]

And that does it for this part. Let’s proceed.

Covariance matrix

Let’s take a step back here and understand the difference between variance and covariance. Variance reports variation of a single random variable — let’s say the weight of a person, and covariance reports how much two random variables vary — like weight and height of a person.

On the diagonal of the covariance matrix we have variances, and other elements are the covariances.

Let’s not dive into the math here as you have the video for that part. Here’s how to obtain the covariance matrix in Numpy:

features = X_scaled.T
cov_matrix = np.cov(features)
cov_matrix[:5]

Cool. As you can see, the diagonal elements are identical, and the matrix is symmetrical. Up next, eigendecomposition.

Eigendecomposition

Eigendecomposition is a process that decomposes a square matrix into eigenvectors and eigenvalues. Eigenvectors are simple unit vectors, and eigenvalues are coefficients which give the magnitude to the eigenvectors.

We know so far that our covariance matrix is symmetrical. As it turns out, eigenvectors of symmetric matrices are orthogonal. For PCA this means that we have the first principal component which explains most of the variance. Orthogonal to that is the second principal component, which explains most of the remaining variance. This is repeated for N number of principal components, where N equals to number of original features.

And this turns out to be neat for us — principal components are sorted by percentage of variance explained, as we can decide how many should we keep. For example, if we have 100 features originally, but the first 3 principal components explain 95% of the variance, then it makes sense to keep only these 3 for visualizations and model training.

As this isn’t a math lecture on eigendecomposition, I think it’s time to do some practical work next. Feel free to explore the theoretical part on your own.

We can perform the eigendecomposition through Numpy, and it returns a tuple, where the first element represents eigenvalues and the second one represents eigenvectors:

values, vectors = np.linalg.eig(cov_matrix)
values[:5]
vectors[:5]

Just from this, we can calculate the percentage of explained variance per principal component:

explained_variances = []
for i in range(len(values)):
    explained_variances.append(values[i] / np.sum(values))
 
print(np.sum(explained_variances), ‘\n’, explained_variances)

The first value is just the sum of explained variances — and must be equal to 1. The second value is an array, representing the explained variance percentage per principal component.

The first two principal components account for around 96% of the variance in the data. Cool.

Let’s now dive into some visualizations where we can see the clear purpose of applying PCA.


Visualizations

Previously we’ve got to the conclusions that we as humans can’t see anything above 3 dimensions. Iris dataset had 4 dimensions initially (4 features), but after applying PCA we’ve managed to explain most of the variance with only 2 principal components.

Now we’ll create a Pandas DataFrame object consisting of those two components, alongside the target class. Here’s the code:

projected_1 = X_scaled.dot(vectors.T[0])
projected_2 = X_scaled.dot(vectors.T[1])
res = pd.DataFrame(projected_1, columns=[‘PC1’])
res[‘PC2’] = projected_2
res[‘Y’] = y
res.head()

Okay, and now with the power of Python’s visualization libraries, let’s first visualize this dataset in 1 dimension — as a line. To do so we’ll need to ditch the second principal component. The easiest way is to hardcode Y values as zeros, as the scatter plot requires values for both X and Y axis:

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(20, 10))
sns.scatterplot(res[‘PC1’], [0] * len(res), hue=res[‘Y’], s=200)

Just look at how separable the Setosa class is. Virginica and Versicolor are tougher to classify, but we should still get most of the classifications correct only with a single principal component.

Let’s now see how this looks in a 2D space:

plt.figure(figsize=(20, 10))
sns.scatterplot(res[‘PC1’], [0] * len(res), hue=res[‘Y’], s=200)

Awesome. For fun, try to include the third principal component and plot a 3D scatter plot.

And that does it for this article. Let’s wrap things up in the next section.


Before you leave

Until now I’ve seen either purely mathematical or purely library-based articles on PCA. It’s easy to do it with Scikit-Learn, but I wanted to take a more manual approach here because there’s a lack of articles online which do so.

I hope you’ve managed to follow along and that this “abstract concept” of dimensionality reduction isn’t so abstract anymore.

Thanks for reading.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

Comments are closed.