The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.
翻译:确保分类器能够提供可靠的置信度分数对于实现知情的决策至关重要。为此,近年来的工作主要聚焦于校准误差,即模型分数的过度自信或不足自信。然而,仅有校准还不够:即使一个完全校准且具有最佳准确率的分类器,其置信度分数也可能远低于真实的 posterior 概率。这是由于组间损失造成的,该损失源于具有相同置信度分数但不同真实 posterior 概率的样本。适当的评分规则理论表明,给定校准损失后,描述个体误差缺失的环节正是组间损失。尽管存在多种校准损失的估计方法,但在标准设置中尚无针对组间损失的估计方法。在此,我们提出一种估计量来近似组间损失。我们展示了现代视觉和自然语言处理领域的神经网络架构存在组间损失,特别是在分布偏移设置中,这凸显了投产前验证的重要性。