Calibration$\unicode{x2014}$the problem of ensuring that predicted probabilities align with observed class frequencies$\unicode{x2014}$is a basic desideratum for reliable prediction with machine learning systems. Calibration error is traditionally assessed via a divergence function, using the expected divergence between predictions and empirical frequencies. Accurately estimating this quantity is challenging, especially in the multiclass setting. Here, we show how to extend a recent variational framework for estimating calibration errors beyond divergences induced induced by proper losses, to cover a broad class of calibration errors induced by $L_p$ divergences. Our method can separate over- and under-confidence and, unlike non-variational approaches, avoids overestimation. We provide extensive experiments and integrate our code in the open-source package probmetrics (https://github.com/dholzmueller/probmetrics) for evaluating calibration errors.
翻译:校准——即确保预测概率与观测类别频率相一致的问题——是机器学习系统实现可靠预测的基本要求。传统上,校准误差通过散度函数进行评估,即计算预测值与经验频率之间的期望散度。准确估计该量具有挑战性,尤其在多分类场景中。本文展示了如何将近期提出的用于估计校准误差的变分框架,从由严格损失诱导的散度推广到涵盖由$L_p$散度诱导的广泛校准误差类别。我们的方法能够区分过度自信与自信不足,并且与非变分方法不同,避免了高估问题。我们提供了大量实验,并将代码集成至开源软件包probmetrics(https://github.com/dholzmueller/probmetrics)中,以支持校准误差的评估。