Data PreparationMachine Learning

Data Scaling for Machine Learning - The Essential Guide

It’s possible that you will come across datasets with lots of numerical noise built-in, such as variance or differently-scaled data, so a good preprocessing is a must before even thinking about machine learning. A good preprocessing solution for this type of problem is often referred to as standardization.

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In scikit-learn this is often a necessary step because many models assume that the data you are training on is normally distributed, and if it isn’t, your risk biasing your model.

You can standardize your data in different ways, and in this article, we’re going to talk about the popular data scaling method — data scaling. Or standard scaling to be more precise.

It’s also important to note that standardization is a preprocessing method applied to continuous, numerical data, and there are a few different scenarios in which you want to use it:

  1. When working with any kind of model that uses a linear distance metric or operates on a linear space — KNN, linear regression, K-means
  2. When a feature or features in your dataset have high variance — this could bias a model that assumes the data is normally distributed, if a feature in has a variance that’s an order of magnitude or greater than other features

Let’s now proceed with the data scaling.


Data scaling

Scaling is a method of standardization that’s most useful when working with a dataset that contains continuous features that are on different scales, and you’re using a model that operates in some sort of linear space (like linear regression or K-nearest neighbors)

Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. This will make it easier to linearly compare features. Also, this a requirement for many models in scikit-learn.

Let’s take a look at a dataset called wine:

import pandas as pd 
import numpy as np
from sklearn import datasets

wine = datasets.load_wine()
wine = pd.DataFrame(
    data=np.c_[wine[‘data’], wine[‘target’]],
    columns=wine[‘feature_names’] + [‘target’]
)

We want to use the ashalcalinity_of_ash, and magnesium columns in the wine dataset to train a linear model, but it’s possible that these columns are all measured in different ways, which would bias a linear model. Using the describe()function returns descriptive statistics about the dataset:

wine[[‘magnesium’, ‘ash’, ‘alcalinity_of_ash’]].describe()

We can see that the max of ash is 3.23, max of alcalinity_of_ash is 30, and a max of magnesium is 162. There are huge differences between the values, and a machine learning model could here easily interpret magnesium as the most important attribute, due to larger scale.

Let’s standardize them in a way that allows for the use in a linear model. Here are the steps:

  1. Import StandardScaler and create an instance of it
  2. Create a subset on which scaling is performed
  3. Apply the scaler fo the subset

Here’s the code:

from sklearn.preprocessing import StandardScaler

# create the scaler
ss = StandardScaler()

# take a subset of the dataframe you want to scale
wine_subset = wine[[‘magnesium’, ‘ash’, ‘alcalinity_of_ash’]]

# apply the scaler to the dataframe subset
wine_subset_scaled = ss.fit_transform(wine_subset)

Awesome! Let’s see how the first couple of rows of scaled data look like:

The values are now much closer together. To see how scaling actually impacts the model’s predictive power, let’s make a quick KNN model. 

First, with the non-scaled data:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X = wine.drop(‘target’, axis=1)
y = wine[‘target’]

X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)


print(knn.score(X_test, y_test))
>>> 0.666666666666

Not so good of an accuracy. Let’s scale the entire dataset and repeat the process:

ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)


print(knn.score(X_test, y_test))
>>> 0.97777777777777777

As you can see, the accuracy of our model increased significantly. I’ll leave further tweaking of this KNN classifier up to you, and who knows, maybe you can get all the classifications correctly.

Let’s wrap things up in the next section.


Before you go

That’s pretty much it for data standardization and why it is important. We’ll compare StandardScaler with other scalers some other time. The take-home point of this article is that you should use StandardScaler whenever you need normally distributed (relatively) features.

To be more precise, use StandardScaler whenever you’re using a model that assumes that the data is normally distributed — such as KNN or linear regression.

Thanks for reading.


Join my private email list for more helpful insights.

Dario Radečić
Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

You may also like

26 Comments

  1. Nice blog! Is your theme custom made or did you download
    it from somewhere? A design like yours with a few simple adjustements would
    really make my blog shine. Please let me know where you got your
    design. Kudos

    1. Thanks! It’s a theme from ThemeForest, slightly altered.

  2. I was suggested this blog by my cousin. I am not sure whether
    this post is written by him as nobody else
    know such detailed about my problem. You’re incredible!
    Thanks!

  3. Wow! At last I got a web site from where I be capable of genuinely obtain valuable information regarding my study and knowledge.

  4. I’m not sure where you’re getting your info, but great topic.
    I needs to spend some time learning much more or understanding more.
    Thanks for excellent info I was looking for this info for my mission.

  5. It’s a pity you don’t have a donate button! I’d certainly donate to this outstanding
    blog! I suppose for now i’ll settle for bookmarking and
    adding your RSS feed to my Google account. I look forward to
    new updates and will talk about this website with my Facebook
    group. Talk soon!

    1. Thanks for the feedback!

  6. Hi! This post couldn’t be written any better! Reading this post reminds me of my previous
    room mate! He always kept chatting about this. I will forward this
    post to him. Pretty sure he will have a good read.
    Thank you for sharing!

  7. Hey there would you mind sharing which blog platform you’re using?
    I’m planning to start my own blog in the near future but I’m having a
    hard time choosing between BlogEngine/Wordpress/B2evolution and Drupal.
    The reason I ask is because your design seems different then most blogs and I’m looking for
    something completely unique. P.S
    Apologies for being off-topic but I had to ask!

    1. Sure, WordPress is a way to go, no arguing there.
      I think you should opt for a premium theme, $20 – $40 and a couple of hours of work should get you this look 🙂

  8. I think this is among the most vital information for me.
    And i’m glad reading your article. But should remark on few
    general things, The site style is wonderful,
    the articles is really great : D. Good job,
    cheers

  9. I was excited to discover this great site.

    I want to to thank you for your time for this fantastic read!!

    I definitely really liked every little bit of it and I have you book-marked to look at new things on your blog.

  10. I feel that is one of the such a lot important information for me.
    And i’m glad reading your article. However
    wanna commentary on some normal things, The site taste is ideal, the articles
    is really great : D. Good job, cheers

  11. My spouse and I stumbled over here by a different website and thought
    I might as well check things out. I like what I see so now i’m following
    you. Look forward to looking at your web page yet again.

  12. Hiya! I simply would like to give a huge thumbs up for the nice data you’ve got here on this post. I can be coming back to your weblog for more soon.

  13. Howdy very cool web site!! Man .. Beautiful .. Wonderful ..
    I’ll bookmark your blog and take the feeds additionally?
    I am satisfied to seek out numerous useful information here within the
    submit, we’d like work out more strategies in this
    regard, thanks for sharing. . . . . .

  14. I’m not that much of a internet reader to be honest but your blogs really nice, keep it up! I’ll go ahead and bookmark your website to come back later. Many thanks

  15. Attractive section of content. I just stumbled upon your site and in accession capital to assert that I acquire actually enjoyed account your blog posts. Any way I’ll be subscribing to your feeds and even I achievement you access consistently quickly.

  16. you’re really a good webmaster. The website loading speed is amazing.
    It kind of feels that you are doing any unique trick.
    Moreover, The contents are masterwork. you have done a excellent task on this subject!

  17. Having read this I believed it was very enlightening.

    I appreciate you taking the time and energy to put this content together.
    I once again find myself spending a significant amount of time
    both reading and posting comments. But so what, it was still
    worth it! dildok3

  18. I would like to thnkx for the efforts you have put in writing this blog. I am hoping the same high-grade blog post from you in the upcoming as well. In fact your creative writing abilities has inspired me to get my own blog now. Really the blogging is spreading its wings quickly. Your write up is a good example of it.

  19. Your style is very unique compared to other folks I have read stuff from.

    Thank you for posting when you’ve got the opportunity, Guess I will just book mark this blog.

  20. Hi there, I found your web site via Google while looking for a related topic, your web site came up, it looks good. I’ve bookmarked it in my google bookmarks.

  21. Greetings from California! I’m bored to death at work so I decided to check out your website on my iphone during lunch break. I love the knowledge you provide here and can’t wait to take a look when I get home. I’m shocked at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .. Anyhow, superb site!

  22. I’m still learning from you, but I’m improving myself.
    I definitely enjoy reading all that is written on your site.Keep the stories coming.
    I enjoyed it!

  23. Very interesting info !Perfect just what I was searching for! “Fear not that thy life shall come to an end, but rather fear that it shall never have a beginning.” by John Henry Cardinal Newman.

Comments are closed.