Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.
翻译:我们考虑两个预测者,分别对随时间发生的一系列事件做出单一预测。我们提出一个相对基础的问题:如何在线地或事后地比较这些预测者,同时避免对预测和结果生成方式做出不可验证的假设?本文通过设计新颖的序列推断程序,用于估计预测得分的时变差异,从而为这一问题提供了严谨答案。为此,我们采用置信序列(CS)——即可持续监测、在任意数据依赖的停止时间(“任意有效”)下均有效的置信区间序列。我们构建的CS宽度自适应于得分差异的潜在方差。其构造基础是一个博弈论统计框架,在该框架中我们进一步识别出用于序贯检验弱零假设(即一个预测者平均而言优于另一个,而非始终如此)的e过程和p过程。我们的方法不对预测或结果的分布做任何假设;主要定理适用于任意有界得分,随后我们为无界得分提供了替代方法。通过比较现实中的棒球和天气预报者,我们对所提方法进行了实证验证。