Calibrated predictions are useful because their numerical values can be interpreted as probabilities. Calibration errors are therefore widely used to evaluate, compare, and tune probabilistic predictors. Recently, Haghtalab et al. (2024) introduced an additional requirement for such measures: truthfulness. A calibration measure is truthful if a predictor minimizes its expected measured error by reporting the true conditional label distribution. Many standard empirical calibration errors are non-truthful: a predictor may appear better calibrated by distorting its probabilities rather than reporting them truthfully. We study the practical role of truthfulness for calibration measurement in multiclass prediction. First, we introduce perfectly truthful calibration errors for multidimensional linear properties of the label distribution, generalizing the truthful calibration error for binary predictions in Hartline et al. (2025). This framework includes full multiclass calibration and classwise calibration. We also identify a truthful correction for confidence calibration. Second, we characterize the decision-theoretic implications of these truthful errors. For calibrated predictors, truthful calibration errors preserve the Blackwell dominance: a more informative calibrated predictor receives no larger expected error. Third, we show that this decision-theoretic interpretation explains and mitigates the well-observed ranking robustness problem of binned calibration errors. Empirically, non-truthful confidence-based errors can reverse model rankings when the number of bins changes, while our truthful errors give more stable rankings across binning choices.
翻译:校准预测因数值可解释为概率而具有实用价值,因此校准误差被广泛用于评估、比较和调优概率预测器。近期Haghtalab等人(2024)为这类度量引入了真实性要求:若预测器通过报告真实条件标签分布使预期测量误差最小化,则该校准度量具有真实性。许多标准经验校准误差不具有真实性——预测器可能通过扭曲概率而非如实报告来显得校准更优。我们研究真实性在多类预测校准测量中的实践作用。首先,针对标签分布的多维线性性质引入完全真实校准误差,将Hartline等人(2025)中二值预测的真实校准误差进行泛化。该框架涵盖完整多类校准与逐类校准,同时识别出置信校准的真实性修正方案。其次,刻画了这些真实误差的决策论内涵:对于校准预测器,真实校准误差保持布莱克威尔优势——信息量更丰富的校准预测器不会获得更大期望误差。第三,论证该决策论解释可阐明并缓解分箱校准误差普遍存在的排序鲁棒性问题。实证表明,非真实的基于置信度的误差在分箱数变化时可能颠倒模型排序,而我们的真实误差在不同分箱选择下能提供更稳定的排序结果。