In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.
翻译:在安全关键型应用中,数据驱动模型不仅须具备精确性,还需提供可靠的不确定性估计。这一特性通常被称为校准,对于风险感知决策至关重要。在回归任务中,已涌现出大量校准度量标准与重校准方法。然而,这些度量标准在定义、假设与尺度上差异显著,导致跨研究结果的解读与比较困难。此外,多数重校准方法仅通过少量度量标准子集进行评估,尚不明确其改进效果能否泛化至不同校准概念。本研究系统性地从文献中提取并分类回归校准度量标准,独立于特定建模方法或重校准方案进行基准测试。通过基于真实、合成及人为误校准数据的受控实验,我们证明校准度量标准常产生相互矛盾的结果。分析揭示了显著的不一致性:众多度量标准对同一重校准结果的评估存在分歧,部分甚至指向对立结论。这种不一致性尤为令人担忧,因其可能助长选择性选取度量标准以营造成功假象的行为。经测试,我们确定期望归一化校准误差(ENCE)与覆盖宽度准则(CWC)为最可靠的度量标准。本研究结果凸显了度量标准选择在校准研究中的关键作用。