Predictions are a central part of water resources research. Historically, physically-based models have been preferred; however, they have largely failed at modeling hydrological processes at a catchment scale and there are some important prediction problems that cannot be modeled physically. As such, machine learning (ML) models have been seen as a valid alternative in recent years. In spite of their availability, well-optimized state-of-the-art ML strategies are not being widely used in water resources research. This is because using state-of-the-art ML models and optimizing hyperparameters requires expert mathematical and statistical knowledge. Further, some analyses require many model trainings, so sometimes even expert statisticians cannot properly optimize hyperparameters. To leverage data and use it effectively to drive scientific advances in the field, it is essential to make ML models accessible to subject matter experts by improving automated machine learning resources. ML models such as XGBoost have been recently shown to outperform random forest (RF) models which are traditionally used in water resources research. In this study, based on over 150 water-related datasets, we extensively compare XGBoost and RF. This study provides water scientists with access to quick user-friendly RF and XGBoost model optimization.
翻译:预测是水资源研究的核心。历史上,基于物理的模型曾备受青睐;然而,它们在流域尺度上模拟水文过程方面大多未能成功,且存在一些无法通过物理建模的重要预测问题。因此,机器学习模型近年来被视为有效的替代方案。尽管这些模型已可用,但经过充分优化的前沿机器学习策略并未在水资源研究中得到广泛应用。这是因为使用前沿机器学习模型并优化超参数需要专业的数学和统计学知识。此外,某些分析需要多次模型训练,有时即使专业统计学家也无法恰当优化超参数。为了有效利用数据推动该领域的科学进展,必须通过改进自动化机器学习工具,使领域专家也能便捷使用机器学习模型。XGBoost等机器学习模型近期已被证明优于水资源研究中传统使用的随机森林模型。本研究基于150余个与水文相关的数据集,对XGBoost和随机森林进行了广泛比较,为水文学者提供了快速、用户友好的随机森林与XGBoost模型优化方案。