Data PreparationMachine Learning

# Data Scaling for Machine Learning - The Essential Guide

It’s possible that you will come across datasets with lots of numerical noise built-in, such as variance or differently-scaled data, so a good preprocessing is a must before even thinking about machine learning. A good preprocessing solution for this type of problem is often referred to as standardization.

Standardization is a preprocessing method used to transform continuous data to make it look normally distributed. In `scikit-learn` this is often a necessary step because many models assume that the data you are training on is normally distributed, and if it isn’t, your risk biasing your model.

You can standardize your data in different ways, and in this article, we’re going to talk about the popular data scaling method — data scaling. Or standard scaling to be more precise.

It’s also important to note that standardization is a preprocessing method applied to continuous, numerical data, and there are a few different scenarios in which you want to use it:

1. When working with any kind of model that uses a linear distance metric or operates on a linear space — KNN, linear regression, K-means
2. When a feature or features in your dataset have high variance — this could bias a model that assumes the data is normally distributed, if a feature in has a variance that’s an order of magnitude or greater than other features

Let’s now proceed with the data scaling.

### Data scaling

Scaling is a method of standardization that’s most useful when working with a dataset that contains continuous features that are on different scales, and you’re using a model that operates in some sort of linear space (like linear regression or K-nearest neighbors)

Feature scaling transforms the features in your dataset so they have a mean of zero and a variance of one. This will make it easier to linearly compare features. Also, this a requirement for many models in `scikit-learn`.

Let’s take a look at a dataset called wine:

```import pandas as pd
import numpy as np
from sklearn import datasets

wine = pd.DataFrame(
data=np.c_[wine[‘data’], wine[‘target’]],
columns=wine[‘feature_names’] + [‘target’]
)```

We want to use the `ash``alcalinity_of_ash`, and `magnesium` columns in the wine dataset to train a linear model, but it’s possible that these columns are all measured in different ways, which would bias a linear model. Using the `describe()`function returns descriptive statistics about the dataset:

`wine[[‘magnesium’, ‘ash’, ‘alcalinity_of_ash’]].describe()`

We can see that the max of `ash` is 3.23, max of `alcalinity_of_ash` is 30, and a max of `magnesium` is 162. There are huge differences between the values, and a machine learning model could here easily interpret `magnesium` as the most important attribute, due to larger scale.

Let’s standardize them in a way that allows for the use in a linear model. Here are the steps:

1. Import `StandardScaler` and create an instance of it
2. Create a subset on which scaling is performed
3. Apply the scaler fo the subset

Here’s the code:

```from sklearn.preprocessing import StandardScaler

# create the scaler
ss = StandardScaler()

# take a subset of the dataframe you want to scale
wine_subset = wine[[‘magnesium’, ‘ash’, ‘alcalinity_of_ash’]]

# apply the scaler to the dataframe subset
wine_subset_scaled = ss.fit_transform(wine_subset)```

Awesome! Let’s see how the first couple of rows of scaled data look like:

The values are now much closer together. To see how scaling actually impacts the model’s predictive power, let’s make a quick KNN model.

First, with the non-scaled data:

```from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

X = wine.drop(‘target’, axis=1)
y = wine[‘target’]

X_train, X_test, y_train, y_test = train_test_split(X, y)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))
>>> 0.666666666666```

Not so good of an accuracy. Let’s scale the entire dataset and repeat the process:

```ss = StandardScaler()

X_scaled = ss.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

print(knn.score(X_test, y_test))
>>> 0.97777777777777777```

As you can see, the accuracy of our model increased significantly. I’ll leave further tweaking of this KNN classifier up to you, and who knows, maybe you can get all the classifications correctly.

Let’s wrap things up in the next section.

### Before you go

That’s pretty much it for data standardization and why it is important. We’ll compare `StandardScaler` with other scalers some other time. The take-home point of this article is that you should use `StandardScaler` whenever you need normally distributed (relatively) features.

To be more precise, use `StandardScaler` whenever you’re using a model that assumes that the data is normally distributed — such as KNN or linear regression.

Join my private email list for more helpful insights.

Previous Post

Next article

### Machine Learning With SQL - It’s Easier Than You Think

Data scientist, blogger, and enthusiast. Passionate about deep learning, computer vision, and data-driven decision making.

Data Science

Data Science

Data Science

### Top 3 Reasons Why I Sold My M1 Macbook Pro as a Data Scientist

it from somewhere? A design like yours with a few simple adjustements would
really make my blog shine. Please let me know where you got your
design. Kudos

1. Thanks! It’s a theme from ThemeForest, slightly altered.

2. I was suggested this blog by my cousin. I am not sure whether
this post is written by him as nobody else
know such detailed about my problem. You’re incredible!
Thanks!

3. Wow! At last I got a web site from where I be capable of genuinely obtain valuable information regarding my study and knowledge.

4. I’m not sure where you’re getting your info, but great topic.
I needs to spend some time learning much more or understanding more.
Thanks for excellent info I was looking for this info for my mission.

5. It’s a pity you don’t have a donate button! I’d certainly donate to this outstanding
blog! I suppose for now i’ll settle for bookmarking and
group. Talk soon!

1. Thanks for the feedback!

6. Hi! This post couldn’t be written any better! Reading this post reminds me of my previous
post to him. Pretty sure he will have a good read.
Thank you for sharing!

7. Hey there would you mind sharing which blog platform you’re using?
I’m planning to start my own blog in the near future but I’m having a
hard time choosing between BlogEngine/Wordpress/B2evolution and Drupal.
The reason I ask is because your design seems different then most blogs and I’m looking for
something completely unique. P.S

1. Sure, WordPress is a way to go, no arguing there.
I think you should opt for a premium theme, \$20 – \$40 and a couple of hours of work should get you this look 🙂

8. I think this is among the most vital information for me.
general things, The site style is wonderful,
the articles is really great : D. Good job,
cheers

9. I was excited to discover this great site.

I want to to thank you for your time for this fantastic read!!

I definitely really liked every little bit of it and I have you book-marked to look at new things on your blog.

10. I feel that is one of the such a lot important information for me.
wanna commentary on some normal things, The site taste is ideal, the articles
is really great : D. Good job, cheers

11. My spouse and I stumbled over here by a different website and thought
I might as well check things out. I like what I see so now i’m following
you. Look forward to looking at your web page yet again.

12. Hiya! I simply would like to give a huge thumbs up for the nice data you’ve got here on this post. I can be coming back to your weblog for more soon.

13. Howdy very cool web site!! Man .. Beautiful .. Wonderful ..
I am satisfied to seek out numerous useful information here within the
submit, we’d like work out more strategies in this
regard, thanks for sharing. . . . . .

14. I’m not that much of a internet reader to be honest but your blogs really nice, keep it up! I’ll go ahead and bookmark your website to come back later. Many thanks

15. Attractive section of content. I just stumbled upon your site and in accession capital to assert that I acquire actually enjoyed account your blog posts. Any way I’ll be subscribing to your feeds and even I achievement you access consistently quickly.

16. you’re really a good webmaster. The website loading speed is amazing.
It kind of feels that you are doing any unique trick.
Moreover, The contents are masterwork. you have done a excellent task on this subject!

17. Having read this I believed it was very enlightening.

I appreciate you taking the time and energy to put this content together.
I once again find myself spending a significant amount of time
worth it! dildok3

18. I would like to thnkx for the efforts you have put in writing this blog. I am hoping the same high-grade blog post from you in the upcoming as well. In fact your creative writing abilities has inspired me to get my own blog now. Really the blogging is spreading its wings quickly. Your write up is a good example of it.

19. Your style is very unique compared to other folks I have read stuff from.

Thank you for posting when you’ve got the opportunity, Guess I will just book mark this blog.

20. Hi there, I found your web site via Google while looking for a related topic, your web site came up, it looks good. I’ve bookmarked it in my google bookmarks.

21. Greetings from California! I’m bored to death at work so I decided to check out your website on my iphone during lunch break. I love the knowledge you provide here and can’t wait to take a look when I get home. I’m shocked at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .. Anyhow, superb site!

22. I’m still learning from you, but I’m improving myself.
I definitely enjoy reading all that is written on your site.Keep the stories coming.
I enjoyed it!

23. Very interesting info !Perfect just what I was searching for! “Fear not that thy life shall come to an end, but rather fear that it shall never have a beginning.” by John Henry Cardinal Newman.

Data Preparation

Data Preparation

Data Preparation