Why I stopped using Grid Search CV
I’ve always used scikit-learn’s GridSearchCV
or RandomisedSearchCV
to tune hyperparameters within a classifier. Even when I’m coding neural networks using PyTorch, I’ll loop over a list (e.g. learning_rate = [0.001, 0.01, 0.1, 0.5]
) to determine the best value. However, these methods requires a lot of manual effort and parameter values have to be pre-determined — how do I know if the values in search field are good in the first place? Is it possible to automatically tune the hyperparamaters using all possible values within a range, while I go for a jog?
Well I might have a found an answer — Optuna!
In my previous post, I have experimented with different types of ML algorithms and the top two best performing ones (excluding ensemble) are Logistic Regression and Adaboost. Hence, in this tutorial, I shall demonstrate how to use Optuna to select the best model among these two candidates. The metric to be maximised is recall score.
First, let’s read the data
Then, we set the objective, which is the maximise the recall score.
Output:
Number of finished trials: 200
Best trial:
Value: 0.8137882018479033
Params:
classifier: AdaBoost
n_estimators: 10
learning_rate: 0.555344080589061
CPU times: user 51.5 s, sys: 10.3 s, total: 1min 1s
Wall time: 48.9 s
From the study above, it is apparent that Adaboost is the superior classifier, increasing the recall score from 0.79 (from the previous post to 0.81! 🥳 You can find the full code in this repo.
Tips
- Use the same
random_state
during tuning and model inference to stay consistent. - For massive datasets, I suggest to obtain a subset and tune the model with Optuna.
- If the dataset is highly imbalanced, undersample data and ensure that all classes are represented, then tune the model.
Comments