This project investigates the possibility of predicting the winner during the second innings of a cricket match. By applying different machine learning methods and innovative shrinkage strategies, the project aims to provide accurate predictions that could be useful for captains, coaches, and broadcasting teams to enhance strategic decisions during live matches. This repository only contains the code for the web application.
Real-time winning predictions during cricket matches keep the broadcast engaging for viewers. This inspired us to create a model that predicts the winner as the second innings progresses, allowing viewers to better understand the evolving match dynamics. For team management, this tool can offer insights to improve on-field decision-making. The availability of ball-by-ball datasets for public use further motivated us to utilize this data to develop and compare different machine learning models.
The project aims to achieve the following:
- Identify key factors that influence the outcome of a match during the second innings.
- Develop machine learning models based on these predictors.
- Deploy the models in a web-based dashboard to predict match outcomes in real time.
The dataset, sourced from Kaggle (Jamie Welsh), includes ball-by-ball data from T20 cricket matches between 2005 and 2023. The dataset contains 425,119 records with 34 attributes describing each delivery. After filtering for second-innings data, 200,304 observations were used in the analysis.
The prediction task is modeled as a classification problem, where the target variable (Chased Successfully) is binary:
- 0: Chasing team lost.
- 1: Chasing team won.
Thirteen key predictors were chosen based on their practical relevance to the match outcome:
- Runs Required
- Balls Remaining
- Current Score
- Balls Delivered
- Wickets Remaining
- Target Score
- Current Run Rate (CRR)
- Required Run Rate (RRR)
- Striker's Score
- Balls Faced by Striker
- Non-Striker's Score
- Balls Faced by Non-Striker
- Runs Conceded by Bowler
Derived predictors such as Balls Delivered, Wickets Remaining, CRR, and RRR were included. Missing values in CRR and RRR were imputed with their mean and maximum values, respectively.
The following Python libraries were used:
- NumPy: For numerical operations.
- Pandas: For data manipulation.
- Scikit-learn: For machine learning model development.
- Streamlit: For building the web-based dashboard.
The table below summarizes the performance of each classifier based on accuracy when applied to the test dataset:
Classifier | Accuracy (%) |
---|---|
Logistic Regression | 81.89 |
Penalized Logistic Regression (LASSO) | 81.89 |
Shrinkage Estimation | 81.94 |
Positive Shrinkage Estimation | 81.94 |
Linear Shrinkage Estimation | 81.89 |
Pretest Estimation | 81.96 |
Shrinkage Pretest Estimation | 81.96 |
Gradient Boosting Machine | 91.66 |
The Gradient Boosting Machine model performed the best, achieving an accuracy of 91.66%, while the other classifiers showed accuracy between 81.89% and 81.96%.
- Python: Programming language used for model development.
- NumPy: Numerical computation.
- Pandas: Data manipulation and preprocessing.
- Scikit-learn: Machine learning library.
- Streamlit: Framework for building the web-based dashboard.
The Streamlit app is available in the Web Application folder, and the model notebook can be found in the Model folder. You can access the app by visiting https://match-predictor.onrender.com/.