Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expectation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the "region of reliability" of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.
翻译:多元概率时间序列预测通常通过适当的评分规则进行评估,即那些在真实分布期望下达到最小的函数。然而,这一性质不足以确保在非渐近状态下具有良好的区分能力。本文首次对用于时间序列预测评估的适当评分规则进行了系统的有限样本研究。通过功效分析,我们确定了评分规则的“可靠性区域”,即一组实际条件,在这些条件下可以依赖该规则来识别预测误差。我们在一个综合合成基准上进行了分析,该基准专门设计用于测试真实分布与预测分布之间的若干关键差异,并通过一个电力生产问题的应用来衡量我们发现在实际任务中的泛化能力。我们的结果揭示了文献中通常进行的多元概率预测评估存在的关键缺陷。