Consider two forecasters, each making a single prediction for a sequence of events over time. We ask a relatively basic question: how might we compare these forecasters, either online or post-hoc, while avoiding unverifiable assumptions on how the forecasts and outcomes were generated? In this paper, we present a rigorous answer to this question by designing novel sequential inference procedures for estimating the time-varying difference in forecast scores. To do this, we employ confidence sequences (CS), which are sequences of confidence intervals that can be continuously monitored and are valid at arbitrary data-dependent stopping times ("anytime-valid"). The widths of our CSs are adaptive to the underlying variance of the score differences. Underlying their construction is a game-theoretic statistical framework, in which we further identify e-processes and p-processes for sequentially testing a weak null hypothesis -- whether one forecaster outperforms another on average (rather than always). Our methods do not make distributional assumptions on the forecasts or outcomes; our main theorems apply to any bounded scores, and we later provide alternative methods for unbounded scores. We empirically validate our approaches by comparing real-world baseball and weather forecasters.
翻译:考虑两个预测者,各自对随时间演化的序列事件逐一做出预测。我们提出了一个相对基础的问题:如何在线或事后比较这些预测者,同时避免对预测结果和观测结果的生成过程做出不可验证的假设?在本文中,我们通过设计新颖的序列推断程序来估计预测得分的时变差异,为这一问题提供了严谨的答案。为此,我们采用了置信序列(CS)——即可以连续监测且能在任意依赖于数据的停止时间成立("始终有效")的置信区间序列。我们构建的CS宽度能够自适应于得分差异的潜在方差。其构建基础是一个博弈论统计框架,在该框架中我们进一步识别了e过程和p过程,用于序贯检验一个弱零假设——即一个预测者是否在平均意义上(而非始终)优于另一个预测者。我们的方法不对预测或观测结果施加分布假设;主要定理适用于任意有界得分,随后我们针对无界得分提供了替代方法。通过比较真实世界的棒球和天气预报预测者,我们对所提方法进行了实证验证。