My Research Projects

Interpretable Machine Learning Identifies Data-Driven Critical Thresholds for Water Potability Assessment (2023)

Artificial Intelligence-Data Science-Environmental Science-Machine Learning

Author:
Mohammad Motaghianfar


Abstract

Access to clean drinking water is a major global health issue. Traditional methods rely on fixed standards, while many machine learning tools act as “black boxes,” offering predictions without clear reasons. This study uses a transparent machine learning approach with eXtreme Gradient Boosting (XGBoost) on a water quality dataset (3,276 samples, 9 features). We applied Shapley Additive exPlanations (SHAP) to pinpoint critical levels where water becomes unsafe. Missing data was filled using K-nearest neighbors, and we optimized the model with Bayesian techniques. Sulfate (0.642 importance), pH (0.554), and Hardness (0.452) were the top factors. Key thresholds were Sulfate at 185.1 mg/L (below WHO’s 250 mg/L), pH at 3.0 (too low), and Hardness at 96.4 mg/L (a new finding). The model achieved 62.0% accuracy and 61.5% AUC-ROC, showing complex patterns in the data. This approach provides clear, data-driven guidance for water safety monitoring and could refine regulations. It offers a model for transparent AI in environmental health.

Keywords: interpretable machine learning, water quality, SHAP analysis, critical thresholds, water potability, XGBoost


1. Introduction

Waterborne illnesses affect millions yearly, making reliable water quality checks vital. The World Health Organization sets guidelines based on lab studies, but real-world data could improve these standards. Machine learning can predict water safety, but most models don’t explain why water is unsafe or at what levels problems arise.

Our study uses interpretable machine learning to find specific thresholds where water quality turns risky. We believe SHAP analysis with ensemble models can reveal data-driven limits to guide water management. Our goal is to show how transparent AI can offer practical insights for safer drinking water.


2. Methods

Data Collection

We used the Water Potability dataset from Kaggle, with 3,276 samples measuring nine features: pH, Hardness, Solids, Chloramines, Sulfate, Conductivity, Organic_carbon, Trihalomethanes, and Turbidity. The target is whether water is potable (1) or not (0).

Data Preprocessing

Missing data (pH: 15.0%, Sulfate: 23.8%, Trihalomethanes: 4.9%) was filled using K-nearest neighbors (k=5) to keep patterns intact. Features were standardized (mean=0, variance=1). The dataset was split 80:20, keeping the class balance (39.0% potable).

Model Development

We chose XGBoost for its ability to handle complex patterns. We optimized settings like tree count (50–300), depth (3–10), and learning rate (0.01–0.3) using Bayesian optimization and 5-fold cross-validation, focusing on F1-score due to uneven classes.

Interpretation Framework

SHAP analysis showed which features matter most and pinpointed thresholds where they shift water to non-potable. We used SHAP dependence plots to find these critical values.

Evaluation Metrics

We measured accuracy, AUC-ROC, precision, recall, and F1-score. Thresholds were checked against SHAP value patterns and compared to WHO guidelines.


3. Results

The XGBoost model reached 62.0% accuracy and 61.5% AUC-ROC on the test set (656 samples). It was better at spotting non-potable water (66% precision) than potable (52%), due to class imbalance.

SHAP analysis highlighted Sulfate (0.642), pH (0.554), and Hardness (0.452) as key predictors. Critical thresholds were: Sulfate at 185.1 mg/L, pH at 3.0, and Hardness at 96.4 mg/L. The Sulfate threshold is 64.9 mg/L below WHO’s guideline, pH is below the safe range (6.5–8.5), and Hardness is a new finding with no WHO standard.


4. Discussion

Our findings show interpretable machine learning can uncover practical water quality thresholds. The Sulfate threshold suggests safety issues may start below current guidelines, possibly due to interactions with other factors. pH’s importance reflects its role in water chemistry, affecting things like toxicity. Hardness’s role, not covered by WHO, may point to links with contamination or treatment issues.

This study stands out by using SHAP to find thresholds in complex, non-linear data, unlike simpler statistical methods. Water facilities could use these thresholds for focused monitoring, and regulators might refine guidelines. However, the dataset’s size and regional focus limit broader use. The 62% accuracy shows the challenge of water quality prediction, but useful insights still emerged. Future work should test these thresholds globally, add time-based data, and include more contaminants.


5. Conclusion

This study shows how transparent machine learning can move beyond predicting water safety to identifying specific risk levels. By finding data-driven thresholds for key factors, we offer practical guidance for water monitoring and potential updates to regulations. This approach is a blueprint for clear AI solutions in environmental health. Future efforts should validate these findings across regions, add more factors, and build real-time monitoring systems.


References

[1] World Health Organization. (2017). Guidelines for drinking-water quality: fourth edition. Geneva: WHO.
[2] Ahmed, U., et al. (2019). Efficient water quality prediction using supervised machine learning. Water, 11(11), 2210.
[3] Naganna, S. R., et al. (2019). Factors influencing groundwater quality assessment using machine learning. Environmental Monitoring and Assessment, 191(5), 1–14.
[4] Molnar, C. (2020). Interpretable Machine Learning. Lulu.com.
[5] Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
[6] Kaggle. (2020). Water Quality Dataset. https://kaggle.com/datasets/adityakadiwal/water-potability
[7] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference.
[8] Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3), 1–20.
[9] WHO. (2003). Sulfate in drinking-water. Background document for development of WHO guidelines.
[10] Gray, N. F. (2014). Drinking water quality: problems and solutions. Cambridge University Press.