We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We present a rigorous framework for analyzing calibration measures, inspired by the literature on property testing. We propose a ground-truth notion of distance from calibration: the $\ell_1$ distance to the nearest perfectly calibrated predictor. We define a consistent calibration measure as one that is polynomially related to this distance. Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently: smooth calibration, interval calibration, and Laplace kernel calibration. The former two give quadratic approximations to the ground truth distance, which we show is information-theoretically optimal in a natural model for measuring calibration which we term the prediction-only access model. Our work thus establishes fundamental lower and upper bounds on measuring the distance to calibration, and also provides theoretical justification for preferring certain metrics (like Laplace kernel calibration) in practice.
翻译:我们研究如何定义和衡量概率预测器的校准距离这一基本问题。尽管完美校准的概念已得到充分理解,但如何量化与完美校准的差距尚未达成共识。文献中提出了众多校准度量方法,但彼此之间的比较尚不明确,且许多常用度量(如期望校准误差)未能满足连续性等基本性质。受性质检验领域文献启发,我们提出了分析校准度量的严格框架。我们定义了校准距离的基准概念:到最近完美校准预测器的ℓ1距离,并将一致校准度量定义为与该距离呈多项式相关的度量。应用该框架,我们识别出三个一致且可高效估计的校准度量:平滑校准、区间校准和拉普拉斯核校准。前两种方法给出了基准距离的二次近似,我们证明在用于衡量校准的自然模型(称为仅预测访问模型)中,这种近似在信息论意义下达到最优。本研究因此建立了衡量校准距离的基本下界与上界,并为实践中偏好某些度量(如拉普拉斯核校准)提供了理论依据。