Every uncalibrated classifier has a corresponding true calibration map that calibrates its confidence. Deviations of this idealistic map from the identity map reveal miscalibration. Such calibration errors can be reduced with many post-hoc calibration methods which fit some family of calibration maps on a validation dataset. In contrast, evaluation of calibration with the expected calibration error (ECE) on the test set does not explicitly involve fitting. However, as we demonstrate, ECE can still be viewed as if fitting a family of functions on the test data. This motivates the fit-on-the-test view on evaluation: first, approximate a calibration map on the test data, and second, quantify its distance from the identity. Exploiting this view allows us to unlock missed opportunities: (1) use the plethora of post-hoc calibration methods for evaluating calibration; (2) tune the number of bins in ECE with cross-validation. Furthermore, we introduce: (3) benchmarking on pseudo-real data where the true calibration map can be estimated very precisely; and (4) novel calibration and evaluation methods using new calibration map families PL and PL3.
翻译:每个未校准的分类器都对应着一个能校准其置信度的真实校准映射。该理想映射与恒等映射的偏差揭示了误校准的存在。此类校准误差可通过多种后验校准方法得以降低,这些方法在验证数据集上拟合某个校准映射族。相比之下,使用测试集上的期望校准误差(ECE)评估校准则不显式地涉及拟合过程。然而,正如我们所论证的,ECE仍可被视作在测试数据上拟合函数族的过程。这启发了评估的"基于测试集拟合"视角:首先在测试数据上近似校准映射,其次量化其与恒等映射的距离。利用该视角使我们能够解锁错失的机会:(1) 使用丰富的后验校准方法进行校准评估;(2) 通过交叉验证调整ECE的分箱数量。此外,我们提出:(3) 在伪真实数据上进行基准测试,其中真实校准映射可被高精度估计;(4) 采用新型校准映射族PL和PL3的创新校准与评估方法。