Calibration measures and reliability diagrams are two fundamental tools for measuring and interpreting the calibration of probabilistic predictors. Calibration measures quantify the degree of miscalibration, and reliability diagrams visualize the structure of this miscalibration. However, the most common constructions of reliability diagrams and calibration measures -- binning and ECE -- both suffer from well-known flaws (e.g. discontinuity). We show that a simple modification fixes both constructions: first smooth the observations using an RBF kernel, then compute the Expected Calibration Error (ECE) of this smoothed function. We prove that with a careful choice of bandwidth, this method yields a calibration measure that is well-behaved in the sense of (B{\l}asiok, Gopalan, Hu, and Nakkiran 2023a) -- a consistent calibration measure. We call this measure the SmoothECE. Moreover, the reliability diagram obtained from this smoothed function visually encodes the SmoothECE, just as binned reliability diagrams encode the BinnedECE. We also provide a Python package with simple, hyperparameter-free methods for measuring and plotting calibration: `pip install relplot\`.
翻译:校准度量与可靠性图是衡量和解释概率预测器校准效果的两个基本工具。校准度量量化误校准程度,可靠性图则可视化误校准的结构。然而,最常见的可靠性图与校准度量构建方法(即分箱法与ECE)均存在众所周知的缺陷(例如不连续性)。我们证明,通过简单的改进即可修复这两种构建方法:首先使用RBF核平滑观测数据,然后计算该平滑函数的期望校准误差(ECE)。我们证明,在谨慎选择带宽的前提下,该方法能产生符合(Błasiok, Gopalan, Hu, and Nakkiran 2023a)意义上良好行为的校准度量——即一致校准度量。我们将该度量称为SmoothECE。此外,由该平滑函数得到的可靠性图能直观地编码SmoothECE,正如分箱可靠性图编码BinnedECE一样。我们还提供了一个Python包,其中包含用于测量和绘制校准图的简单、无超参数方法:`pip install relplot\`。