Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.
翻译:尽管深度神经网络具有令人印象深刻的泛化能力,但反复被证明在其预测错误时过度自信。解决这一问题被称为模型校准,因此通过改进训练方案和温度缩放等训练后校准程序得到了广泛关注。虽然温度缩放因简单性而频繁使用,但其性能通常不如改进的训练方案。本文中,我们识别了温度缩放性能的一个特定瓶颈。我们证明,对于类别支持集存在重叠的通用分布集上的经验风险最小化器,温度缩放的性能会随类别间重叠程度的增加而下降,并且当类别数量较大时,其性能渐近地变得不比随机猜测更好。另一方面,我们证明,优化由Mixup数据增强技术引起的经验风险的修正形式,实际上可以导致相当好的校准性能,表明在某些情况下训练时校准可能是必要的。我们还通过实验验证了理论结果在实际中的表现:在通过标签噪声引入类别重叠的图像分类基准测试中,Mixup在多项校准指标上显著优于经验风险最小化。