Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expectation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the "region of reliability" of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.
翻译:多元概率时间序列预报通常通过适当评分规则进行评估,即那些在真实分布期望下达到最小值的函数。然而,这一性质并不足以保证在非渐近工况下具有良好的判别能力。本文首次对时间序列预报评估中的适当评分规则进行了系统的有限样本研究。通过功效分析,我们识别了评分规则的"可靠性区域",即该规则能够可靠用于识别预报误差的实际条件集合。我们在一个综合性的合成基准上开展分析,该基准专门设计用于检验真实分布与预报分布之间的若干关键差异,并通过一个电力生产实际应用案例评估了研究结论向真实任务的泛化能力。研究结果揭示了当前文献中广泛采用的多元概率预报评估方法存在的关键缺陷。