Machine learning models can be quite accurate out of the box. But more often than not, the accuracy can improve with hyperparameter tuning.
Hyperparameter tuning is a lengthy process of increasing the model accuracy by tweaking the hyperparameters – values that can’t be learned and need to be specified before the training.
Today you’ll learn three ways of approaching hyperparameter tuning. You’ll go from the most manual approach towards a
GridSearchCV class implemented with the Scikit-Learn library.
The article is structured as follows:
- Dataset loading and preparation
- Manual hyperparameter tuning
- Loop-based hyperparameter tuning
- Hyperparameter tuning with GridSearch
You can download the Notebook for this article here.
Dataset loading and preparation
There’s no need to go crazy here. A simple dataset will do. You’ll work with the Iris dataset loaded straight from the web.
Library-wise, you’ll need Pandas to work with data, and a couple of classes/functions from Scikit-Learn. Here’s how to load in the libraries and the dataset:
head() function will show the following data frame subset:
The dataset is as clean as they come, so there’s no need for additional preparation. Next, you’ll split it into training and testing subsets. Here’s how:
Finally – let’s build a default model. It’ll show you how accurate the model with the default hyperparameters is, and it will serve as a baseline which the tweaked models should outperform.
Here’s how to train a Decision Tree model on the training set, obtain accuracy score and confusion matrix:
The corresponding accuracy and confusion matrix are shown below:
In a nutshell – you want a model with more than 97% accuracy on the test set. Let’s see if hyperparameter tuning can do that.
Manual hyperparameter tuning
You don’t need a dedicated library for hyperparameter tuning. But it’ll be a tedious process.
Before starting, you’ll need to know which hyperparameters you can tune. You can find the entire list in the library documentation. Here is the documentation page for decision trees. You’ll optimize only for the three in this article. These are:
criterion– function which measures the quality of the split, can be either gini (default) or entropy
splitter– a strategy for choosing a split at each node, can be either best (default) or random
max_depth– a maximum depth of a tree, an integer value
You can define a set of hyperparameter values as a dictionary (key-value pairs) and then build separate models from them. Here’s how:
Here are the corresponding accuracies:
To conclude – you’ve already managed to outperform the baseline model, but this approach isn’t scalable. Imagine if you wanted to test for 1000 combinations, which is actually a small number – writing code in this way isn’t a way to go. Let’s improve it next.
Loop-based hyperparameter tuning
You can improve the previous solution by specifying possible hyperparameter values inside a list. There’ll be as many lists as there are hyperparameters. The model is then trained and evaluated inside a nested loop.
Here’s an example code snippet:
As you can see, model accuracy on the test set and the respective hyperparameter values were stored as a dictionary in a list, which was later converted into a data frame. It’s easy to sort the data frame and see which hyperparameter combination did the best:
To conclude – this approach works great, but you’re doomed to use nested loops. It’s okay for three hyperparameters, but imagine optimizing for ten. There must be a better way.
Hyperparameter tuning with GridSearch
GridSearchCV class comes with Scikit-Learn, and it makes hyperparameter tuning a joy. It can take a long time to optimize (nothing to do with the class), but you’re free from writing things manually.
You’ll need to declare a hyperparameter space as a dictionary, where each key is the name of the hyperparameter, and its value is a list of possible values. You can then use the
GridSearchCV class to find an optimal set by calling the
There’s also a benefit of built-in cross-validation with this approach, eliminating the “chance” from the results.
Here’s the entire code snippet:
You can then store the results in a Pandas data frame (for easier inspection) – here’s how:
And here’s how the part of this data frame looks like:
Let’s filter this data frame to keep only the columns of interest – average test score and used hyperparameter values and sort by the average test score:
Here are the results:
That’s a good approach if you’re interested in examining multiple combinations. An easier way exists if you only want the best values:
This property returns a dictionary:
You can pass the dictionary directly to the machine learning model (use unpacking –
And that’s how easy it is to find optimal hyperparameters for a machine learning algorithm. Let’s wrap things up next.
The last approach will get the job done most of the time. You’re free to do the optimization manually, but what’s the point?
Grid search can take a lot of time to finish. Let’s say you have 5 parameters with 5 possible values. That’s 5ˆ5 of possible combinations (3125). Add cross-validation into the picture (let’s say 10-fold), and that is 31250 models you need to train and evaluate.
For these cases, a Randomized grid search might be a better option. Code-wise it works the same as the non-randomized one, so that’s why it wasn’t covered today.
Thanks for reading.
- Top 5 Books to Learn Data Science in 2021
- SHAP: How to Interpret Machine Learning Models With Python
- Top 3 Classification Machine Learning Metrics – Ditch Accuracy Once and For All
- ROC and AUC – How to Evaluate Machine Learning Models
- Precision-Recall Curves: How to Easily Evaluate Machine Learning Models