Confidence-Aware Calibration and Scoring Functions for Curriculum Learning

Despite the great success of state-of-the-art deep neural networks, several studies have reported models to be over-confident in predictions, indicating miscalibration. Label Smoothing has been proposed as a solution to the over-confidence problem and works by softening hard targets during training, typically by distributing part of the probability mass from a `one-hot' label uniformly to all other labels. However, neither model nor human confidence in a label are likely to be uniformly distributed in this manner, with some labels more likely to be confused than others. In this paper we integrate notions of model confidence and human confidence with label smoothing, respectively \textit{Model Confidence LS} and \textit{Human Confidence LS}, to achieve better model calibration and generalization. To enhance model generalization, we show how our model and human confidence scores can be successfully applied to curriculum learning, a training strategy inspired by learning of `easier to harder' tasks. A higher model or human confidence score indicates a more recognisable and therefore easier sample, and can therefore be used as a scoring function to rank samples in curriculum learning. We evaluate our proposed methods with four state-of-the-art architectures for image and text classification task, using datasets with multi-rater label annotations by humans. We report that integrating model or human confidence information in label smoothing and curriculum learning improves both model performance and model calibration. The code are available at \url{https://github.com/AoShuang92/Confidence_Calibration_CL}.

翻译：尽管最先进的深度神经网络取得了巨大成功，但多项研究报告指出模型在预测中过于自信，这反映了校准不足。标签平滑作为过度自信问题的解决方案被提出，其原理是在训练过程中软化硬目标——通常是将“one-hot”标签中的部分概率质量均匀分配到所有其他标签上。然而，模型或人类对标签的置信度不太可能以这种均匀方式分布，某些标签比其他标签更容易被混淆。在本文中，我们将模型置信度和人类置信度的概念与标签平滑相结合，分别提出模型置信度标签平滑和人类置信度标签平滑，以实现更好的模型校准和泛化。为了增强模型泛化能力，我们展示了如何将模型和人类置信度评分成功应用于课程学习——一种受“从易到难”任务学习启发的训练策略。更高的模型或人类置信度评分表明样本更易识别、因此更简单，从而可作为课程学习中排序样本的评分函数。我们使用四种最先进的架构在图像和文本分类任务上评估所提方法，采用带有人类多评分者标签注释的数据集。实验结果表明，在标签平滑和课程学习中整合模型或人类置信度信息可同时提升模型性能与校准效果。相关代码已开源在\url{https://github.com/AoShuang92/Confidence_Calibration_CL}。