Date: July 2024
GitHub: BinaryClassification
Our goal was to predict potential customers who are more likely to churn using various machine learning techniques with a focus on optimizing recall and precision.
This post summarizes our journey through the stages of data exploration, feature engineering, model selection, and refinement.
We began by understanding the dataset’s structure and performing necessary preprocessing. This included handling missing values, encoding categorical variables, and performing initial exploratory data analysis (EDA) to identify patterns and relationships in the data.
Power BI Report
Key Insights:
Technologies Used:
To simplify the model, we dropped features with low Matthews correlation coefficient and used Variance Inflation Factor (VIF) to remove multicollinear features. We then implemented simple models using class weights to establish a baseline.
Given the imbalanced nature of the churn data, we applied Synthetic Minority Over-sampling Technique for Nominal data (SMOTE-N) to balance the classes. This improved the model’s ability to predict churners effectively.
Key Insights:
To improve our model’s decision boundaries, we analyzed errors and introduced decision trees. This helped overcome the limitations of linear decision boundaries and led to better predictions.
Key Takeaways:
We implemented ensemble methods such as Random Forest and Gradient Boosting to enhance model performance. AutoML tools were also utilized to streamline model selection and hyperparameter tuning. Our final model, CatBoost, achieved excellent recall and precision.
Models Used:
Key Insights:
By systematically progressing through data preprocessing, feature selection, class balancing, error analysis, and advanced techniques, we successfully built a robust model for predicting customer churn. The final model, CatBoost, achieved an impressive recall and precision of 86%, proving highly effective at identifying churners.