Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.
翻译:尽管深度神经网络具有令人印象深刻的泛化能力,但反复证明它们在预测错误时过于自信。解决这一问题被称为模型校准,并因此通过改进训练方案和训练后校准程序(如温度缩放)获得了广泛关注。虽然温度缩放因其简便性而被频繁使用,但通常被改进型训练方案超越。在本工作中,我们识别出温度缩放的性能存在特定瓶颈。我们证明,对于一类支持集存在类间重叠的通用分布集,经验风险最小化器的温度缩放性能随类间重叠程度的增加而退化,且当类别数量众多时渐近地变得不优于随机猜测。另一方面,我们证明优化由Mixup数据增强技术诱导的改进经验风险形式实际上能获得合理良好的校准性能,表明在某些情况下训练时校准可能是必要的。我们还通过实验验证理论结果的实践有效性:在引入标签噪声形式的类重叠图像分类基准上,Mixup在多个校准指标上显著优于经验风险最小化。