Comparison of machine learning algorithms for merging gridded satellite and earth-observed precipitation data

Gridded satellite precipitation datasets are useful in hydrological applications as they cover large regions with high density. However, they are not accurate in the sense that they do not agree with ground-based measurements. An established means for improving their accuracy is to correct them by adopting machine learning algorithms. This correction takes the form of a regression problem, in which the ground-based measurements have the role of the dependent variable and the satellite data are the predictor variables, together with topography factors (e.g., elevation). Most studies of this kind involve a limited number of machine learning algorithms, and are conducted for a small region and for a limited time period. Thus, the results obtained through them are of local importance and do not provide more general guidance and best practices. To provide results that are generalizable and to contribute to the delivery of best practices, we here compare eight state-of-the-art machine learning algorithms in correcting satellite precipitation data for the entire contiguous United States and for a 15-year period. We use monthly data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset, together with monthly earth-observed precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm). The results suggest that extreme gradient boosting (XGBoost) and random forests are the most accurate in terms of the squared error scoring function. The remaining algorithms can be ordered as follows from the best to the worst: Bayesian regularized feed-forward neural networks, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks, linear regression.

翻译：网格化卫星降水数据集因其覆盖范围广、空间密度高而在水文应用中具有重要价值，但与地面观测数据存在偏差。提升其精度的成熟方法是通过机器学习算法进行校正。该校正过程本质上是回归问题：以地面观测值为因变量，卫星数据及地形因子（如海拔）为预测变量。现有研究多涉及少量机器学习算法，且局限于小区域和短时段，所得结论具有局部特性，缺乏普适性指导与最佳实践。为获得可推广结论并建立最佳实践标准，本研究针对美国全境及15年时间尺度，比较了八种主流机器学习算法在卫星降水数据校正中的表现。我们采用基于人工神经网络的遥感信息降水估算（PERSIANN）网格化月尺度数据集，以及全球历史气候网络月数据库2.0版（GHCNm）的地面观测月降水量数据。结果表明，极端梯度提升（XGBoost）和随机森林在平方误差评分函数下精度最优。其余算法按性能排序依次为：贝叶斯正则化前馈神经网络、多元自适应多项式样条（poly-MARS）、梯度提升机（gbm）、多元自适应回归样条（MARS）、前馈神经网络与线性回归。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日