The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration

In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated, resulting in over-confident predictions. Miscalibration can be exacerbated by overfitting due to the minimization of the cross-entropy during training, as it promotes the predicted softmax probabilities to match the one-hot label assignments. This yields a pre-softmax activation of the correct class that is significantly larger than the remaining activations. Recent evidence from the literature suggests that loss functions that embed implicit or explicit maximization of the entropy of predictions yield state-of-the-art calibration performances. We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses. Specifically, these losses could be viewed as approximations of a linear penalty (or a Lagrangian) imposing equality constraints on logit distances. This points to an important limitation of such underlying equality constraints, whose ensuing gradients constantly push towards a non-informative solution, which might prevent from reaching the best compromise between the discriminative performance and calibration of the model during gradient-based optimization. Following our observations, we propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances. Comprehensive experiments on a variety of image classification, semantic segmentation and NLP benchmarks demonstrate that our method sets novel state-of-the-art results on these tasks in terms of network calibration, without affecting the discriminative performance. The code is available at https://github.com/by-liu/MbLS .

翻译：尽管深度神经网络表现出色，但近期研究表明其校准性能较差，导致预测过度自信。由于训练过程中交叉熵最小化使得预测的Softmax概率趋近于独热编码标签，加剧了过拟合导致的校准偏差，使正确类别对应的预Softmax激活值显著大于其他激活值。近期文献证据表明，嵌入隐式或显式预测熵最大化的损失函数可取得最先进的校准性能。我们提出了统一约束优化视角来解析当前最优校准损失函数：这些损失可视为施加于逻辑斯谛距离等式约束的线性惩罚（或拉格朗日函数）的近似。这揭示了此类等式约束的重要局限性——其产生的梯度持续推动模型朝向无信息解优化，可能阻碍基于梯度的优化过程中达成判别性能与校准效果的最佳平衡。基于这一发现，我们提出了基于不等式约束的简单灵活泛化方法，对逻辑斯谛距离施加可控边距。在图像分类、语义分割及自然语言处理基准上的综合实验表明，我们的方法在不影响判别性能的前提下，在这些任务上建立了网络校准的新最优结果。代码开源于 https://github.com/by-liu/MbLS 。