Why I stopped using Grid Search CV

2 minute read

I’ve always used scikit-learn’s GridSearchCV or RandomisedSearchCV to tune hyperparameters within a classifier. Even when I’m coding neural networks using PyTorch, I’ll loop over a list (e.g. learning_rate = [0.001, 0.01, 0.1, 0.5]) to determine the best value. However, these methods requires a lot of manual effort and parameter values have to be pre-determined — how do I know if the values in search field are good in the first place? Is it possible to automatically tune the hyperparamaters using all possible values within a range, while I go for a jog?

Well I might have a found an answer — Optuna!

In my previous post, I have experimented with different types of ML algorithms and the top two best performing ones (excluding ensemble) are Logistic Regression and Adaboost. Hence, in this tutorial, I shall demonstrate how to use Optuna to select the best model among these two candidates. The metric to be maximised is recall score.

First, let’s read the data

df = pd.read_csv('preprocessed.csv') #insert dataset here

y = df.pop('Churn')
X = df

Then, we set the objective, which is the maximise the recall score.

def objective(trial):

    # Selecting the best model out of these two candidates
    classifier_name = trial.suggest_categorical("classifier", ["Logistic", "AdaBoost"])
    if classifier_name == "Logistic":
        
        # Add parameters here
        penalty = trial.suggest_categorical('penalty', ['l2', 'l1'])
        if penalty == 'l1':
            solver = 'saga'
        else:
            solver = 'lbfgs'
        regularization = trial.suggest_uniform('logistic-regularization', 0.01, 10)
        model = LogisticRegression(penalty=penalty, 
                                   C=regularization, 
                                   solver=solver, 
                                   random_state=0)
    else:
        
        # Add parameters here
        ada_n_estimators = trial.suggest_int("n_estimators", 10, 500, step = 10)
        ada_learning_rate = trial.suggest_float("learning_rate", 0.1, 3)
        
        model = sklearn.ensemble.AdaBoostClassifier(
            n_estimators=ada_n_estimators,
            random_state=0
        )

    score = cross_val_score(model, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()
    return accuracy


# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_trial)
print(study.best_trial.params) # Find out the parameter values

Output:

Number of finished trials: 200
Best trial:
  Value: 0.8137882018479033
  Params: 
    classifier: AdaBoost
    n_estimators: 10
    learning_rate: 0.555344080589061
CPU times: user 51.5 s, sys: 10.3 s, total: 1min 1s
Wall time: 48.9 s

From the study above, it is apparent that Adaboost is the superior classifier, increasing the recall score from 0.79 (from the previous post to 0.81! 🥳 You can find the full code in this repo.

Tips

  • Use the same random_state during tuning and model inference to stay consistent.
  • For massive datasets, I suggest to obtain a subset and tune the model with Optuna.
  • If the dataset is highly imbalanced, undersample data and ensure that all classes are represented, then tune the model.

Updated:

Comments