Skill scores, which measure the relative improvement of a forecasting method over a benchmark via consistent scoring functions and proper scoring rules, are a standard tool in forecast evaluation, yet their sampling uncertainty is rarely rigorously quantified. With modern forecasting applications being increasingly multivariate and involving evaluations across multiple horizons, variables, spatial locations, and forecasting methods, standard tools like the pairwise Diebold-Mariano forecast accuracy test or pointwise confidence intervals fail to account for the multiple comparison problem, leading to inflated Type I error rates and invalid joint inference. To address the lack of a coherent, statistically rigorous framework for quantifying uncertainty across these multi-dimensional evaluation problems, we introduce simultaneous confidence bands for expected scores and skill scores. Our framework provides a versatile tool for joint inference that is applicable to any forecast type from mean and quantile to full distributional forecasts. We develop a bootstrap implementation and show that our bands are valid under multivariate extensions of the classical Diebold-Mariano assumptions. We demonstrate the practical utility of the approach in two case studies by quantifying the benefits of time-varying parameter models for macroeconomic forecasting, and by comparing data-driven and physics-based models in probabilistic weather forecasting.
翻译:技能得分通过一致评分函数和适当评分规则衡量预报方法相对于基准的相对改进,是预报评估中的标准工具,但其抽样不确定性鲜少得到严格量化。随着现代预报应用日益呈现多变量特征,并涉及跨多个预测期、变量、空间位置及预报方法的评估,诸如成对Diebold-Mariano预报精度检验或逐点置信区间等标准工具无法应对多重比较问题,导致第一类错误率膨胀及联合推断失效。为弥补当前在多维评估问题中缺乏连贯且统计严谨的不确定性量化框架的不足,我们引入了期望得分与技能得分的联合置信带。该框架为联合推断提供了通用工具,适用于从均值、分位数到完整分布预报的任何预报类型。我们开发了自助法实现方案,并证明在经典Diebold-Mariano假设的多变量扩展条件下,该置信带具有有效性。通过两项案例研究——量化时变参数模型在宏观经济预报中的优势,以及对比数据驱动与基于物理的模型在概率天气预报中的表现——我们展示了该方法的实际效用。