Ensemble-size-dependence of deep-learning post-processing methods that minimize an (un)fair score: motivating examples and a proof-of-concept solution

翻译：集成规模对最小化（非）公平评分的深度学习后处理方法的影响：激励性示例与概念验证解决方案

Christopher David Roberts

Fair scores reward ensemble forecast members that behave like samples from the same distribution as the verifying observations. They are therefore an attractive choice as loss functions to train data-driven ensemble forecasts or post-processing methods when large training ensembles are either unavailable or computationally prohibitive. The adjusted continuous ranked probability score (aCRPS) is fair and unbiased with respect to ensemble size, provided forecast members are exchangeable and interpretable as conditionally independent draws from an underlying predictive distribution. However, distribution-aware post-processing methods that introduce structural dependency between members can violate this assumption, rendering aCRPS unfair. We demonstrate this effect using two approaches designed to minimize the expected aCRPS of a finite ensemble: (1) a linear member-by-member calibration, which couples members through a common dependency on the sample ensemble mean, and (2) a deep-learning method, which couples members via transformer self-attention across the ensemble dimension. In both cases, the results are sensitive to ensemble size and apparent gains in aCRPS can correspond to systematic unreliability characterized by over-dispersion. We introduce trajectory transformers as a proof-of-concept that ensemble-size independence can be achieved. This approach is an adaptation of the Post-processing Ensembles with Transformers (PoET) framework and applies self-attention over lead time while preserving the conditional independence required by aCRPS. When applied to weekly mean $T_{2m}$ forecasts from the ECMWF subseasonal forecasting system, this approach successfully reduces systematic model biases whilst also improving or maintaining forecast reliability regardless of the ensemble size used in training (3 vs 9 members) or real-time forecasts (9 vs 100 members).

翻译：公平评分奖励那些行为类似于来自与验证观测相同分布的集成预报成员。因此，当大型训练集成不可用或计算成本过高时，公平评分作为损失函数来训练数据驱动的集成预报或后处理方法具有吸引力。调整连续分级概率评分（aCRPS）在集成规模方面是公平且无偏的，前提是预报成员可交换且可解释为来自底层预测分布的条件独立抽取。然而，引入成员间结构依赖性的分布感知后处理方法可能违反这一假设，导致aCRPS不公平。我们通过两种旨在最小化有限集成期望aCRPS的方法来证明这一效应：（1）线性逐成员校准，该方法通过样本集成均值的共同依赖耦合成员；（2）深度学习方法，该方法通过Transformer在集成维度上的自注意力机制耦合成员。在这两种情况下，结果对集成规模敏感，aCRPS的明显增益可能对应于以过度离散为特征的系统性不可靠性。我们引入轨迹Transformer作为实现集成规模独立性的概念验证。该方法是对后处理集成Transformer（PoET）框架的改编，在保持aCRPS所需条件独立性的同时，在预报时效维度上应用自注意力机制。当应用于ECMWF次季节预报系统的周平均$T_{2m}$预报时，该方法成功减少了系统性模式偏差，同时无论训练（3 vs 9个成员）或实时预报（9 vs 100个成员）使用的集成规模如何，均改善或保持了预报可靠性。