Water resources serve as the cornerstone of human livelihoods and economic progress, with intrinsic links to both public health and environmental well-being. The accurate prediction of water quality stands as a pivotal factor in enhancing water resource management and combating pollution. This research, employing diverse performance metrics, assesses the efficacy of five distinct models, namely, linear regression, Random Forest, XGBoost, LightGBM, and MLP neural network, in forecasting pH values within Georgia, USA. Concurrently, LightGBM attains the highest average precision among all models examined. Tree-based models underscore their supremacy in addressing regression challenges. Furthermore, the performance of MLP neural network is sensitive to feature scaling. Additionally, we expound upon and dissect the reasons behind the superior precision of the machine learning models when they are compared to the original study, which factors in time dependencies and spatial considerations. The primary objective of this endeavor is to establish a robust predictive pipeline, specifically tailored for practical applications. It caters not only to individuals well-versed in the realm of data science but also to those lacking specialization in particular application domains. In essence, we offer a fresh perspective for achieving relative precision in data science methodologies, emphasizing both prediction accuracy and interpretability.
翻译:水资源是人类生计与经济发展的基石,与公共卫生及环境福祉密切相关。准确预测水质是提升水资源管理与应对污染的关键因素。本研究采用多种性能指标,评估了线性回归、随机森林、XGBoost、LightGBM及多层感知器(MLP)神经网络五种模型在预测美国佐治亚州水体pH值中的有效性。结果表明,LightGBM在所有模型中达到了最高平均精度。基于树的模型凸显了其在回归问题中的优势,而MLP神经网络的性能对特征缩放较为敏感。此外,我们阐明并剖析了相较于考虑时间依赖性与空间因素的原始研究,本研究中机器学习模型取得更高精度的原因。本工作的主要目标是构建一个专为实际应用设计的稳健预测流程,不仅适用于数据科学领域的专业研究人员,也面向特定应用领域中的非专业人士。本质上,我们为数据科学方法中实现相对精度提供了新视角,兼顾了预测准确性与可解释性。