To facilitate effective decision-making, gridded satellite precipitation products should include uncertainty estimates. Machine learning has been proposed for issuing such estimates. However, most existing algorithms for this purpose rely on quantile regression. Distributional regression offers distinct advantages over quantile regression, including the ability to model intermittency as well as a stronger ability to extrapolate beyond the training data, which is critical for predicting extreme precipitation. In this work, we introduce the concept of distributional regression for the engineering task of creating precipitation datasets through data merging. Building upon this concept, we propose new ensemble learning methods that can be valuable not only for spatial prediction but also for prediction problems in general. These methods exploit conditional zero-adjusted probability distributions estimated with generalized additive models for location, scale, and shape (GAMLSS), spline-based GAMLSS and distributional regression forests as well as their ensembles (stacking based on quantile regression, and equal-weight averaging). To identify the most effective methods for our specific problem, we compared them to benchmarks using a large, multi-source precipitation dataset. Stacking emerged as the most successful strategy. Three specific stacking methods achieved the best performance based on the quantile scoring rule, although the ranking of these methods varied across quantile levels. This suggests that a task-specific combination of multiple algorithms could yield significant benefits.
翻译:为促进有效决策,网格化卫星降水产品应包含不确定性估计。机器学习已被提出用于提供此类估计。然而,现有大多数相关算法依赖于分位数回归。分布回归相较于分位数回归具有独特优势,包括能够模拟间歇性降水以及更强的训练数据外推能力,这对极端降水预测至关重要。本研究将分布回归概念引入通过数据融合创建降水数据集的工程任务中。基于此概念,我们提出了新的集成学习方法,这些方法不仅对空间预测有价值,也对一般性预测问题具有重要价值。这些方法利用通过位置、尺度和形状的广义可加模型(GAMLSS)、基于样条的GAMLSS以及分布回归森林及其集成(基于分位数回归的堆叠法和等权重平均法)估计的条件零调整概率分布。为确定针对我们特定问题的最有效方法,我们使用多源降水大数据集将其与基准方法进行比较。堆叠法被证明是最成功的策略。基于分位数评分规则,三种具体堆叠方法取得了最佳性能,尽管这些方法的排名随分位数水平变化而变化。这表明针对特定任务组合多种算法可能带来显著效益。