Safety-critical prediction systems, such as autonomous vehicles, weather forecasters, and medical monitors, commonly rely on probabilistic forecasters. These forecasters make predictions about possible future outcomes, and their quality and robustness needs to be validated and certified. Often, only accuracy -- the mean of the predictions -- is evaluated against true outcomes. However, for safety-critical scenarios and decision making under uncertainty, the full distributional properties of the forecasts should be checked: do the observed prediction errors actually follow the forecasted probability distributions? To this end, we introduce a framework for calibration checks: statistical tests that validate distributional properties of forecasts when measured over many samples. In order to support ease-of-use in real-world operations, these checks produce a single accept/reject decision for data collected from a forecaster. This contrasts typical calibration calculations which produce one or multiple continuous calibration scores and require expertise to implement in a validation workflow. We further support operationalization by introducing modifications to calibration testing that (a) reject only overconfident predictions, allowing for pessimistic or cautious predictions in safety-critical settings, and (b) tolerate small, operationally acceptable deviations even for large numbers of validation samples. We organize the calibration checking process into a modular pipeline comprising four steps: (i) the data model, (ii) the chosen metric, (iii) the hypothesis formulation, and (iv) the testing procedure. Each step consists of independently swappable components, thereby supporting a large variety of possible use-cases and trade-offs. We demonstrate the applicability of the framework on two complementary example problems, weather forecasting and robot pose estimation.
翻译:安全关键预测系统,例如自动驾驶车辆、天气预报和医疗监测系统,通常依赖于概率预测器。这些预测器对未来可能的结果做出预测,其质量与鲁棒性需要经过验证和认证。通常,只有预测的准确性(即预测的均值)会与实际结果进行对比评估。然而,在安全关键场景及不确定性条件下的决策中,应检查预测的完整分布特性:观测到的预测误差是否确实遵循预测的概率分布?为此,我们引入了一个用于校准检验的框架:即通过统计检验来验证预测值在大量样本上的分布特性。为了支持在实际操作中易于使用,这些检验会针对从某个预测器收集的数据给出一个单一的接受/拒绝决策。这与典型的校准计算不同,后者会产生一个或多个连续的校准分数,并且需要专业知识才能在验证流程中实施。我们进一步通过引入校准测试的修改来支持操作化,这些修改能:(a)仅拒绝过度自信的预测,从而允许在安全关键场景中做出悲观或谨慎的预测;(b)即使面对大量验证样本,也能容忍微小且在操作上可接受的偏差。我们将校准检验过程组织成一个模块化流程,包含四个步骤:(i)数据模型;(ii)所选度量指标;(iii)假设制定;(iv)检验程序。每个步骤由可独立替换的组件构成,从而支持多种可能的使用场景和权衡。我们在两个互补的示例问题——天气预报和机器人位姿估计上,展示了该框架的适用性。