We consider the problem of evaluating forecasts of binary events whose predictions are consumed by rational agents who take an action in response to a prediction, but whose utility is unknown to the forecaster. We show that optimizing forecasts for a single scoring rule (e.g., the Brier score) cannot guarantee low regret for all possible agents. In contrast, forecasts that are well-calibrated guarantee that all agents incur sublinear regret. However, calibration is not a necessary criterion here (it is possible for miscalibrated forecasts to provide good regret guarantees for all possible agents), and calibrated forecasting procedures have provably worse convergence rates than forecasting procedures targeting a single scoring rule. Motivated by this, we present a new metric for evaluating forecasts that we call U-calibration, equal to the maximal regret of the sequence of forecasts when evaluated under any bounded scoring rule. We show that sublinear U-calibration error is a necessary and sufficient condition for all agents to achieve sublinear regret guarantees. We additionally demonstrate how to compute the U-calibration error efficiently and provide an online algorithm that achieves $O(\sqrt{T})$ U-calibration error (on par with optimal rates for optimizing for a single scoring rule, and bypassing lower bounds for the traditionally calibrated learning procedures). Finally, we discuss generalizations to the multiclass prediction setting.
翻译:我们考虑对二元事件预测的评估问题,这类预测由理性主体(agents)根据预测采取行动,但其效用对预测者而言未知。研究表明,针对单一评分规则(如Brier评分)优化预测无法保证对所有可能主体实现低遗憾。相比之下,经过良好校准的预测能够保证所有主体获得次线性遗憾。然而,校准并非必要条件(未校准的预测也可能为所有可能主体提供良好的遗憾保证),且校准预测程序在收敛速度上显著劣于针对单一评分规则的预测程序。基于此,我们提出一种名为"U-校准"的新评估指标,其定义为在任意有界评分规则下预测序列的最大遗憾。我们证明:次线性U-校准误差是所有主体获得次线性遗憾保证的充分必要条件。此外,我们展示了如何高效计算U-校准误差,并给出一种在线算法,该算法可实现$O(\sqrt{T})$的U-校准误差(与针对单一评分规则优化的最优速率相当,且突破了传统校准学习程序的下界)。最后,我们讨论了该框架在多类预测场景中的推广。