T20 Cricket Win Prediction with Deep Learning

Date: April 2024

Objective

Our goal was to predict the outcome of T20 cricket matches using various deep learning techniques with a focus on optimizing accuracy and real-time prediction capabilities. This post summarizes our project journey through data collection, feature engineering, model selection, and deployment.

Overview Architecture

Architecture Overview

Data Collection and Building Pipeline

Data Collection

We began by gathering data from sources like Cricsheet and ESPN cricket stats. Necessary preprocessing steps included handling missing values, encoding categorical variables, and performing exploratory data analysis (EDA) to identify key patterns and relationships in the match data.

Technologies Used:

Data Processing: Apache Airflow, Apache Spark
Storage: Hadoop HDFS

Building the Pipeline

Utilizing Apache Airflow, we orchestrated the data pipeline to ensure efficient data flow and management. The pipeline includes data extraction, transformation, and loading (ETL) processes, enabling seamless integration of diverse data sources.

Key Components:

ETL Orchestration: Apache Airflow
Data Transformation: Apache Spark
Storage Management: Hadoop HDFS

Data Augmentation

To enhance the diversity of our training dataset and improve the model's generalization capabilities, we incorporated augmentation, like limmiting the data to certain overs. This approach allows us to obtain match win predictions at any given stage in the match, not only during the final overs.

Futher implementation of data augmentation like using random noise or GANs will be encouraged to improve the model performance.

Modeling with Weights & Biases (WandB)

To enhance model tracking and experiment management, we integrated Weights & Biases (WandB) into our workflow. This facilitated real-time monitoring of training processes and streamlined collaboration.

Technologies Used:

Experiment Tracking: WandB
Machine Learning Framework: PyTorch

Hyperparameter Tuning

We conducted hyperparameter tuning to optimize model performance. This involved adjusting parameters such as learning rate, batch size, and network architecture to achieve the best possible results.

Technologies Used:

Hyperparameter Optimization: WandB Sweeps
Model Training: PyTorch

Detailed Report

To achieve optimal model performance, we utilized WandB Sweeps to systematically explore the hyperparameter space. The tuning process focused on four main aspects:

Hyperparameter Tuning

Learning Rate:
- Experimented with a range of learning rates from 0.0001 to 0.1.
- Identified 0.001 as the optimal learning rate that balances convergence speed and stability.
Batch Size:
- Tested batch sizes of 32, 64, and 128.
- Found that a batch size of 32 provided the best trade-off between training time and model accuracy.
Network Architecture:
- Modified the number of layers and units per layer.
- Enhanced the model's capacity without overfitting by adding an additional hidden layer with 128 neurons.
- Hidden Size: Increased to 256 neurons to further improve model capacity.
- Number of Layers: Configured the model with 3 layers to balance complexity and computational efficiency.
Dropout Rate:
- Experimented with dropout rates ranging from 0.65 to 0.7.
- Settled on a dropout rate of 0.7 to effectively prevent overfitting while maintaining model performance.

The hyperparameter tuning process led to a significant improvement in model accuracy from 80% to 85%, demonstrating the effectiveness of systematic optimization techniques.

Results

The final model achieved an accuracy of 85% on the test set, which was evaluated across different overs. And found accuracy increasing as the overs progress which is trivial.

Results

Win Example

Loss Example

Conclusion

By systematically progressing through data preprocessing, feature engineering, class balancing, error analysis, and advanced modeling techniques, we successfully built a robust model for predicting T20 cricket match outcomes. The final model achieved an accuracy of 85%, demonstrating its effectiveness in real-time prediction scenarios. Integrating this model into a cricket application enhances user engagement by providing insightful analytics and live win probabilities.

Overall Technologies Used:

Language: Python
Core Libraries: Pandas, Polars, Apache Spark, NumPy, Scikit-learn, PyTorch
Visualization: Matplotlib, Seaborn
Workflow Management: Apache Airflow
Experiment Tracking: Weights & Biases (WandB)
Deployment: Dockerized for easy future deployment

This structured approach allowed us to effectively predict the outcomes of T20 cricket matches, ensuring our model was both accurate and reliable.