Linear regression is the simplest algorithm you’ll encounter while studying machine learning. If we’re talking about simple linear regression, you only need to find values for two parameters – slope and the intercept – but more on that in a bit.
Today you’ll get your hands dirty implementing simple linear regression algorithm from scratch. This is the first of many upcoming from scratch articles, so stay tuned to the blog if you want to learn more.
Today’s article is structured as follows:
- Introduction to Simple Linear Regression
- Math Behind Simple Linear Regression
- From-Scratch Implementation
- Comparison with Scikit-Learn
You can download the corresponding notebook here.
Introduction to Simple Linear Regression
As the name suggests, simple linear regression is simple. It’s an algorithm used by many in introductory machine learning, but it doesn’t require any “learning”. It’s as simple as plugging few values into a formula – more on that in the following section.
In general, linear regression is used to predict continuous variables – something such as stock price, weight, and similar.
Linear regression is a linear algorithm, meaning the linear relationship between input variables (what goes in) and the output variable (the prediction) is assumed. It’s not the end of the world if the relationships in your dataset aren’t linear, as there’s plenty of conversion methods.
Several types of linear regression models exist:
- Simple linear regression – has a single input variable and a single output variable. For example, using height to predict the weight.
- Multiple linear regression – has multiple input variables and a single output variable. For example, using height, body fat, and BMI to predict weight.
Today we’ll deal with simple linear regression. The article on multiple linear regression is coming out next week, so stay tuned to the blog if you want to learn more.
Linear regression is rarely used as a go-to algorithm for solving complex machine learning problems. Instead, it’s used as a baseline model – a point which more sophisticated algorithms have to outperform.
The algorithm is also rather strict on the requirements. Let’s list and explain a few:
- Linear Assumption — model assumes the relationship between variables is linear
- No Noise — model assumes that the input and output variables are not noisy — so remove outliers if possible
- No Collinearity — model will overfit when you have highly correlated input variables
- Normal Distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
- Rescaled Inputs — use scalers or normalizer to make more reliable predictions
You now know enough theory behind this simple algorithm. Let’s look at the math next before the implementation.
Math Behind Simple Linear Regression
In essence, simple linear regression boils down to solving a couple of equations. You only need to solve the line equation, displayed in the following figure:
As you can see, we need to calculate the beta coefficients somehow. X represents input data, so that’s something you already have at your disposal.
The Beta 1 coefficient has to be calculated first. It represents the slope of the line and can be obtained with the following formula:
The Xi represents the current value of the input feature, and X with a bar on top represents the mean of the entire variable. The same goes with Y, but we’re looking at the target variable instead.
Next, we have the Beta 0 coefficient. You can calculate it with the following formula:
And that’s all there is to simple linear regression! Once the coefficient values are calculated, you can plug in the number for X and get the prediction. As simple as that.
Let’s take a look at the Python implementation next.
Let’s start with the library imports. You’ll only need Numpy and Matplotlib for now. The
rcParams modifications are optional, only to make the visuals look a bit better:
Now onto the algorithm implementation. Let’s declare a class called
SimpleLinearRegression with the following methods:
__init__()– the constructor, contains the values for Beta 0 and Beta 1 coefficients. These are initially set to
fit(X, y)– calculates the Beta 0 and Beta 1 coefficients from the input
yparameters. After the calculation is done, the results are stored in the constructor
predict(X)– makes the prediction using the line equation. It throws an error if the
fit()method wasn’t called beforehand.
If you understand the math behind this simple algorithm, implementation in Python is easy. Here’s the entire code snippet for the class:
Next, let’s create some dummy data. We’ll make a range of 300 data points as the input variable, and 300 normally distributed values as the target variable. The target variable is centered around the input variable, with a standard deviation of 20.
You can use the following code snippet to create and visualize the dataset:
The visualization of the dataset is shown in the following figure:
Next, let’s split the dataset into training and testing subsets. You can use the
train_test_split() function from Scikit-learn to do so:
Finally, let’s make an instance of the
SimpleLinearRegression class, fit the training data, and make predictions on the test set. The following code snippet does just that, and also prints the values of Beta 0 and Beta 1 coefficients:
The coefficient values are displayed below:
And that’s your line equation formula. Next, we need a way to evaluate the model. Before doing that, let’s quickly see how the
y_test variables look like.
Here’s what inside the
And here’s how the actual test data looks like:
Not identical, sure, but quite similar overall. For a more quantitative evaluation metric, we’ll use RMSE (Root Mean Squared Error). Here’s how to calculate its value with Python:
The average error is displayed below:
As you can see, our model is around 20 units wrong on average. That’s due to introduced variance when declaring the dataset, so there’s nothing we can do to improve the model further.
If you want to visualize the best fit line, you’d have to retrain the model on the entire dataset and plot the predictions. You can do so with the following code snippet:
Here’s how it looks like:
And that’s all there is to a simple linear regression model. Let’s compare it to a
LinearRegression class from Scikit-Learn and see if there are any severe differences.
Comparison with Scikit-Learn
We want to know if our model is any good, so let’s compare it with something we know works well – a
LinearRegression class from Scikit-Learn.
You can use the following snippet to import the class, train the model, make predictions, and print the values for Beta 0 and Beta 1 coefficients:
The coefficient values are displayed below:
As you can see, the coefficients are nearly identical! Next, let’s check the RMSE value:
Once again, nearly identical! Model quality – check.
Let’s wrap things up in the next section.
Today you’ve learned how to implement simple linear regression algorithm in Python entirely from scratch.
Does that mean you should ditch the de facto standard machine learning libraries? No, not at all. Let me elaborate.
Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other fit and predict data scientist.
Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.
- Are The New M1 Macbooks Any Good for Data Science? Let’s Find Out
- PyTorch + SHAP = Explainable Convolutional Neural Networks
- 3 Ways to Tune Hyperparameters of Machine Learning Models with Python
- Python Parallelism: Essential Guide to Speeding up Your Python Code in Minutes
- Concurrency in Python: How to Speed Up Your Code With Threads