Why I stopped using Grid Search CV

September 6, 2022 2 minute read

I’ve always used scikit-learn’s GridSearchCV or RandomisedSearchCV to tune hyperparameters within a classifier. Even when I’m coding neural networks using PyTorch, I’ll loop over a list (e.g. learning_rate = [0.001, 0.01, 0.1, 0.5]) to determine the best value. However, these methods requires a lot of manual effort and parameter values have to be pre-determined — how do I know if the values in search field are good in the first place? Is it possible to automatically tune the hyperparamaters using all possible values within a range, while I go for a jog?

Well I might have a found an answer — Optuna!

In my previous post, I have experimented with different types of ML algorithms and the top two best performing ones (excluding ensemble) are Logistic Regression and Adaboost. Hence, in this tutorial, I shall demonstrate how to use Optuna to select the best model among these two candidates. The metric to be maximised is recall score.

First, let’s read the data

df = pd.read_csv('preprocessed.csv') #insert dataset here

y = df.pop('Churn')
X = df

Then, we set the objective, which is the maximise the recall score.

def objective(trial):

    # Selecting the best model out of these two candidates
    classifier_name = trial.suggest_categorical("classifier", ["Logistic", "AdaBoost"])
    if classifier_name == "Logistic":
        
        # Add parameters here
        penalty = trial.suggest_categorical('penalty', ['l2', 'l1'])
        if penalty == 'l1':
            solver = 'saga'
        else:
            solver = 'lbfgs'
        regularization = trial.suggest_uniform('logistic-regularization', 0.01, 10)
        model = LogisticRegression(penalty=penalty, 
                                   C=regularization, 
                                   solver=solver, 
                                   random_state=0)
    else:
        
        # Add parameters here
        ada_n_estimators = trial.suggest_int("n_estimators", 10, 500, step = 10)
        ada_learning_rate = trial.suggest_float("learning_rate", 0.1, 3)
        
        model = sklearn.ensemble.AdaBoostClassifier(
            n_estimators=ada_n_estimators,
            random_state=0
        )

    score = cross_val_score(model, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()
    return accuracy


# 3. Create a study object and optimize the objective function.
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_trial)
print(study.best_trial.params) # Find out the parameter values

Output:

Number of finished trials: 200
Best trial:
  Value: 0.8137882018479033
  Params: 
    classifier: AdaBoost
    n_estimators: 10
    learning_rate: 0.555344080589061
CPU times: user 51.5 s, sys: 10.3 s, total: 1min 1s
Wall time: 48.9 s

From the study above, it is apparent that Adaboost is the superior classifier, increasing the recall score from 0.79 (from the previous post to 0.81! 🥳 You can find the full code in this repo.

Tips

Use the same random_state during tuning and model inference to stay consistent.
For massive datasets, I suggest to obtain a subset and tune the model with Optuna.
If the dataset is highly imbalanced, undersample data and ensure that all classes are represented, then tune the model.

Twitter Facebook LinkedIn

Ivan

Why I stopped using Grid Search CV

Tips

Comments