A comprehensive data analysis and machine learning project to predict water potability using various water quality metrics.
AquaSense analyzes water quality data to determine potability using multiple machine learning models. The project includes extensive data visualization, preprocessing, and comparative analysis of different classification algorithms.
- Comprehensive exploratory data analysis (EDA)
- Interactive visualizations using Plotly and Seaborn
- Missing data handling and preprocessing
- Implementation of 7 different machine learning models:
- Logistic Regression
- Decision Tree Classifier
- Random Forest Classifier
- XGBoost Classifier
- K-Nearest Neighbors
- Support Vector Machine
- AdaBoost Classifier
- Model performance comparison and evaluation
The project uses a water potability dataset with the following features:
- pH value
- Hardness
- Solids
- Chloramines
- Sulfate
- Conductivity
- Organic carbon
- Trihalomethanes
- Turbidity
- Potability (target variable)
- Python: Core programming language
- Data Processing: Pandas, NumPy
- Visualization:
- Matplotlib
- Seaborn
- Plotly Express
- Machine Learning:
- Scikit-learn
- XGBoost
- Development Environment: Jupyter Notebook
- Correlation heatmaps
- Distribution plots
- Box plots
- Violin plots
- Pair plots
- Interactive Plotly visualizations
- Missing data analysis
Model | Accuracy Score |
---|---|
Logistic Regression | ✓ |
Decision Tree | ✓ |
Random Forest | ✓ |
XGBoost | ✓ |
K-Nearest Neighbors | ✓ |
SVM | ✓ |
AdaBoost | ✓ |
- Clone the repository:
git clone https://github.com/yourusername/aquasense.git
cd aquasense
- Install required packages:
pip install -r requirements.txt
- Run Jupyter Notebook:
jupyter notebook
- Open
AquaSense.ipynb
to view the analysis
- Python 3.x
- Jupyter Notebook
- Required Python packages:
- pandas
- numpy
- matplotlib
- seaborn
- plotly
- scikit-learn
- xgboost
- Comprehensive analysis of water quality parameters
- Identification of key factors affecting water potability
- Comparative analysis of different machine learning approaches
- Model performance evaluation using various metrics
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Original dataset contributors
- Soumya Kushwaha - Project Author
- GitHub Repository
For any queries or suggestions, please reach out through GitHub issues.
Developed with ❤️ by Soumya Kushwaha