In the previous post, I explored and analysed the Telco Churn dataset to gain a better understanding. This led me to think: why don’t I create a one-stop repository of all the popular ML algorithms to identify churn, or supervised learning in general?
Alright, I admit – I might have fallen into a rabbit hole of ✨machine learning✨, and maybe you didn’t need one in the first place. However, I believe that the repository will be beneficial to me in the long run (and hopefully to you) because I could view all the popular ML algorithms and key information on a single page and experiment rapidly.
The focus of this post will be on structured data (aka tables with a fixed schema), as seen in the image below. In the future, I will also extend this repository to cover other data formats like images, videos, natural language etc, so stay tuned!
Preprocessing Steps
I have tried to reduce the number of preprocessing steps possible as the focus of this blog is on the ML algorithms (not best practice btw). However, if you are interested:
Supervised Learning Models for Structured Data
In my repository, I have added the following algorithms for experimentation:
Linear Models
Logistic Regression
Elastic Net
Support Vector Machine (SVM)
K Nearest Neighbours
Naive Bayes
Decision Tree
Ensemble Methods
Random Forest
Adaboost
Gradient Booster
Majority Voting Classifier
Weighted Classifier
Multi-Layer Perceptron
Model Selection
To quicken my experimentation process, I would run this script to determine the best algorithm(s):
Output:
Accuracy (recall) of LogisticRegressionCV is 0.7967
Accuracy (recall) of ElasticNetCV is 0.269
Accuracy (recall) of SVC is 0.7953
Accuracy (recall) of KNeighborsClassifier is 0.7818
Accuracy (recall) of GaussianNB is 0.7534
Accuracy (recall) of DecisionTreeClassifier is 0.7193
Accuracy (recall) of RandomForestClassifier is 0.7918
Accuracy (recall) of AdaBoostClassifier is 0.7974
Accuracy (recall) of GradientBoostingClassifier is 0.7612
Accuracy (recall) of VotingClassifier is 0.7989
Accuracy (recall) of VotingClassifier is 0.774
Accuracy (recall) of MLPClassifier is 0.7967
Best score is 0.7974 by AdaBoostClassifier
Bonus - Refactor Code for a Model Training Job
After selecting the best-performing model, you may want to submit a training job to AWS Sagemaker or another cloud provider. In that case, you have to refactor your code from a Jupyter notebook .ipynb to a Python .py file. Here is a sample code from the repository.
Comments