Linear regression is the simplest algorithm you’ll encounter while studying machine learning. Multiple linear regression is similar to the simple linear regression covered last week – the only difference being multiple slope parameters. How many? Well, that depends on how many input features there are – but more on that in a bit.
Today you’ll get your hands dirty implementing multiple linear regression algorithm from scratch. This is the second of many upcoming from scratch articles, so stay tuned to the blog if you want to learn more.
Today’s article is structured as follows:
- Introduction to Multiple Linear Regression
- Math Behind Multiple Linear Regression
- From-Scratch Implementation
- Comparison with Scikit-Learn
You can download the corresponding notebook here.
Introduction to Multiple Linear Regression
Multiple linear regression shares the same idea as its simple version – to find the best fitting line (hyperplane) given the input data. What makes it different is the ability to handle multiple input features instead of just one.
The algorithm is rather strict on the requirements. Let’s list and explain a few:
- Linear Assumption — model assumes the relationship between variables is linear
- No Noise — model assumes that the input and output variables are not noisy — so remove outliers if possible
- No Collinearity — model will overfit when you have highly correlated input variables
- Normal Distribution — the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking
- Rescaled Inputs — use scalers or normalizer to make more reliable predictions
Training multiple linear regression model means calculating the best coefficients for the line equation formula. The best coefficients can be calculated through an iterative optimization process, known as gradient descent.
This algorithm calculates the derivates with respect to each coefficient and updates them on each iteration. How much of an update there will be depends on one parameter – learning rate. A high learning rate can lead to “missing” the best parameter values, and a low learning rate can lead to slow optimization.
More on that in the next section, where we’ll discuss the math behind the algorithm.
Math Behind Multiple Linear Regression
The math behind multiple linear regression is a bit more complicated than it was for the simple one, as you can’t simply plug the values into a formula. We’re dealing with an iterative process instead.
The equation we’re solving remains more or less the same:
We don’t have a single beta coefficient for the slope, but instead, we have an entire matrix of them – denoted as w for weights. There’s still a single intercept value – denoted as b for bias.
We’ll have to declare a cost function to continue. This is a function that measures error and represents something we want to minimize. Mean squared error (MSE) is the most common cost function for linear regression:
Put simply, it represents the average square difference between actual (yi) and predicted (y hat) values. The y hat can be expanded into the following:
As mentioned before, we’ll use the gradient descent algorithm to find optimal weights and bias. It relies on partial derivative calculation for each parameter. You can find derived MSE formulas with respect to each parameter below:
Finally, the update process can be summarized into two formulas – one for each parameter. Put simply, old weight (or bias) values are subtracted from the product of learning rate and the derivative calculation:
The alpha parameter represents the learning rate.
The entire process is repeated for the desired number of iterations. Let’s see how this works in practice, by implementing a from-scratch solution with Python.
Let’s start with the library imports. You’ll only need Numpy and Matplotlib for now. The
rcParams modifications are optional, only to make the visuals look a bit better:
Onto the algorithm now. Let’s declare a class called
LinearRegression with the following methods:
__init__()– the constructor, contains the values for learning rate and the number of iterations, alongside the weights and bias (initially set to None). We’ll also create an empty list to track loss at each iteration.
_mean_squared_error(y, y_hat)– “private” method, used as our cost function.
fit(X, y)– iteratively optimizes weights and bias through gradient descent. After the calculation is done, the results are stored in the constructor. We’re also keeping track of loss here.
predict(X)– makes the prediction using the line equation.
If you understand the math behind this simple algorithm, implementation in Python is easy. Here’s the entire code snippet for the class:
Let’s test the algorithm next. We’ll use the Diabetes dataset from Scikit-Learn. The following code snippet loads the dataset and splits it into features and target arrays:
The next step is to split the dataset into training and testing subsets and to train the model. You can do so with the following code snippet:
Here’s how the “optimal” weight matrix looks like:
And here’s the optimal value for our bias term:
And that’s all there is to it! You’ve successfully trained the model for 10000 iterations and landed on a hopefully good set of parameters. Let’s see how good they are by plotting the loss:
Ideally, we should see a line that starts at a high loss value and quickly drops to somewhere near zero:
It looks promising, but how can we know if the loss is low enough to produce a good quality model? Well, we can’t, at least not directly. The best thing we can do loss-wise is to train a couple of models with different learning rates and compare loss curves. The following code snippet does just that:
The results are shown below:
It seems like a learning rate of 0.5 works the best for our data. You can retrain the model to accommodate with the following code snippet:
Here’s the corresponding MSE value on the test set:
And that’s how easy it is to build, train, evaluate, and tweak a multiple linear regression model from scratch! Let’s compare it to a
LinearRegression class from Scikit-Learn and see if there are any severe differences.
Comparison with Scikit-Learn
We want to know if our model is any good, so let’s compare it with something we know works well — a
LinearRegression class from Scikit-Learn.
You can use the following snippet to import the model class, train the model, make predictions, and print the value of the mean squared error on the test set:
Here is the corresponding MSE value:
As you can see, our tweaked model outperformed the default one from Scikit-Learn, but the difference isn’t significant. Model quality – check.
Let’s wrap things up in the next section.
Today you’ve learned how to implement multiple linear regression algorithm in Python entirely from scratch. Does that mean you should ditch the de facto standard machine learning libraries? No, not at all. Let me elaborate.
Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other fit and predict data scientist.
Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.
- Are The New M1 Macbooks Any Good for Data Science? Let’s Find Out
- PyTorch + SHAP = Explainable Convolutional Neural Networks
- 3 Ways to Tune Hyperparameters of Machine Learning Models with Python
- Python Parallelism: Essential Guide to Speeding up Your Python Code in Minutes
- Concurrency in Python: How to Speed Up Your Code With Threads