While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently. Although various calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements the drawbacks that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL's training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.
翻译:尽管预训练语言模型(PLMs)已成为提升文本分类任务准确率的事实标准,但近期研究发现PLMs常出现过度自信的预测。尽管已有多种校准方法被提出,如集成学习与数据增强,但多数方法在计算机视觉基准而非基于PLM的文本分类任务上得到验证。本文针对PLM的置信度校准展开实证研究,涵盖三类方法:置信度惩罚损失、数据增强及集成方法。我们发现,在训练集上过拟合的集成模型校准性能欠佳,同时观察到采用置信度惩罚损失的PLM在校准与准确率之间存在权衡。基于这些发现,我们提出校准PLM(CALL)——一种校准技术组合方案。CALL弥补了单独使用校准方法时可能出现的缺陷,同时提升分类与校准准确率。本文对CALL训练流程的设计选择进行了系统研究,并详细分析了校准技术如何影响PLM的校准性能。