Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of $3/B$ of that required for histogram binning, where $B$ represents the number of bins. Also, a new calibration metric ($TCE_{bpm}$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. $TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.
翻译:分类模型的置信度校准是一种估计预测类别真实后验概率的技术,对于确保实际应用中的可靠决策至关重要。现有的置信度校准方法大多使用统计技术从数据中估计校准曲线或拟合用户定义的校准函数,但往往忽视充分挖掘和利用校准曲线背后的先验分布。然而,一个信息充分的先验分布可以在有限数据或置信度得分的低密度区域提供超越经验数据的宝贵洞见。为填补这一空白,本文提出一种新方法,将校准曲线背后的先验分布与经验数据相结合以估计连续校准曲线,该方法通过将校准数据的采样过程建模为二项过程并最大化该二项过程的似然函数来实现。我们证明该校准曲线估计方法关于数据分布是Lipschitz连续的,且所需样本量仅为直方图分箱法所需样本量的$3/B$,其中$B$表示分箱数量。此外,本文设计了一种新的校准度量指标($TCE_{bpm}$),该指标利用估计的校准曲线来估计真实校准误差(TCE)。$TCE_{bpm}$被证明是一种一致的校准度量。进一步地,通过二项过程建模可从预设的真实校准曲线和置信度得分分布生成逼真的校准数据集,该数据集可作为基准来衡量和比较现有校准度量与真实校准误差之间的差异。我们在真实数据和模拟数据上验证了所提校准方法及度量指标的有效性。