The Expected Calibration Error (ece), the dominant calibration metric in machine learning, compares predicted probabilities against empirical frequencies of binary outcomes. This is appropriate when labels are binary events. However, many modern settings produce labels that are themselves probabilities rather than binary outcomes: a radiologist's stated confidence, a teacher model's soft output in knowledge distillation, a class posterior derived from a generative model, or an annotator agreement fraction. In these settings, ece commits a category error - it discards the probabilistic information in the label by forcing it into a binary comparison. The result is not a noisy approximation that more data will correct. It is a structural misalignment that persists and converges to the wrong answer with increasing precision as sample size grows. We introduce the Soft Mean Expected Calibration Error (smece), a calibration metric for settings where labels are of probabilistic nature. The modification to the ece formula is one line: replace the empirical hard-label fraction in each prediction bin with the mean probability label of the samples in that bin. smece reduces exactly to ece when labels are binary, making it a strict generalisation.
翻译:期望校准误差(ECE)作为机器学习领域主流的校准度量,通过比较预测概率与二元结果的经验频率来评估模型校准性能。该方法适用于标签为二元事件的场景。然而,现代许多应用场景产生的标签本身即为概率值而非二元结果:例如放射科医生标注的置信度、知识蒸馏中教师模型的软输出、生成模型推导的类别后验概率,或标注者间的一致性分数。在这些场景中,ECE存在范畴错误——它通过强制进行二元比较,丢弃了标签中的概率信息。这并非可通过增加数据量修正的噪声近似问题,而是一种结构性错配:随着样本量增加,该误差将持续存在并以更高精度收敛至错误答案。本文提出软平均期望校准误差(SMECE),这是一种适用于概率性标签场景的校准度量。其对ECE公式的修改仅需一行:将每个预测区间内经验硬标签频率替换为该区间样本的概率标签均值。当标签为二元时,SMECE可严格退化为ECE,因此构成对原度量的严格推广。