Why I stopped using Grid Search CV
I’ve always used scikit-learn’s GridSearchCV
or RandomisedSearchCV
to tune hyperparameters within a classifier. Even when I’m coding neural networks using PyTorch, I’ll loop over a list (e.g. learning_rate = [0.001, 0.01, 0.1, 0.5]
) to determine the best value. However, these methods requires a lot of manual effort and parameter values have to be pre-determined — how do I know if the values in search field are good in the first place? Is it possible to automatically tune the hyperparamaters using all possible values within a range, while I go for a jog?
Well I might have a found an answer — Optuna!
In my previous post, I have experimented with different types of ML algorithms and the top two best performing ones (excluding ensemble) are Logistic Regression and Adaboost. Hence, in this tutorial, I shall demonstrate how to use Optuna to select the best model among these two candidates. The metric to be maximised is recall score.
First, let’s read the data
df = pd.read_csv('preprocessed.csv') #insert dataset here
y = df.pop('Churn')
X = df
Then, we set the objective, which is the maximise the recall score.
def objective(trial):
# Selecting the best model out of these two candidates
classifier_name = trial.suggest_categorical("classifier", ["Logistic", "AdaBoost"])
if classifier_name == "Logistic":
# Add parameters here
penalty = trial.suggest_categorical('penalty', ['l2', 'l1'])
if penalty == 'l1':
solver = 'saga'
else:
solver = 'lbfgs'
regularization = trial.suggest_uniform('logistic-regularization', 0.01, 10)
model = LogisticRegression(penalty=penalty,
C=regularization,
solver=solver,
random_state=0)
else:
# Add parameters here
ada_n_estimators = trial.suggest_int("n_estimators", 10, 500, step = 10)
ada_learning_rate = trial.suggest_float("learning_rate", 0.1, 3)
model = sklearn.ensemble.AdaBoostClassifier(
n_estimators=ada_n_estimators,
random_state=0
)
score = cross_val_score(model, X, y, n_jobs=-1, cv=3)
accuracy = score.mean()
return accuracy
# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_trial)
print(study.best_trial.params) # Find out the parameter values
Output:
Number of finished trials: 200
Best trial:
Value: 0.8137882018479033
Params:
classifier: AdaBoost
n_estimators: 10
learning_rate: 0.555344080589061
CPU times: user 51.5 s, sys: 10.3 s, total: 1min 1s
Wall time: 48.9 s
From the study above, it is apparent that Adaboost is the superior classifier, increasing the recall score from 0.79 (from the previous post to 0.81! 🥳 You can find the full code in this repo.
Tips
- Use the same
random_state
during tuning and model inference to stay consistent. - For massive datasets, I suggest to obtain a subset and tune the model with Optuna.
- If the dataset is highly imbalanced, undersample data and ensure that all classes are represented, then tune the model.
Comments