Gridded satellite precipitation datasets are useful in hydrological applications as they cover large regions with high density. However, they are not accurate in the sense that they do not agree with ground-based measurements. An established means for improving their accuracy is to correct them by adopting machine learning algorithms. This correction takes the form of a regression problem, in which the ground-based measurements have the role of the dependent variable and the satellite data are the predictor variables, together with topography factors (e.g., elevation). Most studies of this kind involve a limited number of machine learning algorithms, and are conducted for a small region and for a limited time period. Thus, the results obtained through them are of local importance and do not provide more general guidance and best practices. To provide results that are generalizable and to contribute to the delivery of best practices, we here compare eight state-of-the-art machine learning algorithms in correcting satellite precipitation data for the entire contiguous United States and for a 15-year period. We use monthly data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset, together with monthly earth-observed precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm). The results suggest that extreme gradient boosting (XGBoost) and random forests are the most accurate in terms of the squared error scoring function. The remaining algorithms can be ordered as follows from the best to the worst: Bayesian regularized feed-forward neural networks, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks, linear regression.
翻译:网格化卫星降水数据集因其覆盖范围广、空间密度高而在水文应用中具有重要价值,但与地面观测数据存在偏差。提升其精度的成熟方法是通过机器学习算法进行校正。该校正过程本质上是回归问题:以地面观测值为因变量,卫星数据及地形因子(如海拔)为预测变量。现有研究多涉及少量机器学习算法,且局限于小区域和短时段,所得结论具有局部特性,缺乏普适性指导与最佳实践。为获得可推广结论并建立最佳实践标准,本研究针对美国全境及15年时间尺度,比较了八种主流机器学习算法在卫星降水数据校正中的表现。我们采用基于人工神经网络的遥感信息降水估算(PERSIANN)网格化月尺度数据集,以及全球历史气候网络月数据库2.0版(GHCNm)的地面观测月降水量数据。结果表明,极端梯度提升(XGBoost)和随机森林在平方误差评分函数下精度最优。其余算法按性能排序依次为:贝叶斯正则化前馈神经网络、多元自适应多项式样条(poly-MARS)、梯度提升机(gbm)、多元自适应回归样条(MARS)、前馈神经网络与线性回归。