Pre-trained language models (PLMs) may fail in giving reliable estimates of their predictive uncertainty. We take a close look into this problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. We observe a consistent change in calibration performance across six factors. We find that PLMs don't learn to become calibrated in training, evidenced by the continual increase in confidence, no matter whether the predictions are correct or not. We highlight that our finding somewhat contradicts two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue. Besides unlearnable calibration methods (e.g., label smoothing), we adapt and extend two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Experimental results show that learnable methods significantly reduce PLMs' confidence in wrong predictions. The code is available at \url{https://github.com/lifan-yuan/PLMCalibration}.
翻译:预训练语言模型(PLMs)可能无法对其预测不确定性提供可靠估计。我们深入研究了这一问题,旨在回答两个问题:(1) PLMs能否在训练过程中学会自我校准?(2) 现有校准方法的实际效果如何?针对第一个问题,我们通过细粒度控制实验研究了PLMs校准性能在训练过程中的动态变化。我们控制了六个变量作为因素,包括数据集难度、可用训练样本数量、训练步数、可调参数数量、模型规模及预训练过程。我们发现校准性能在这六个因素上呈现出一致的变化趋势。研究表明PLMs并未在训练中习得校准能力,具体表现为无论预测正确与否,其置信度都持续上升。值得注意的是,这一发现与两个既定结论存在矛盾:(a) 更大的PLMs校准效果更好;(b) 预训练能提升模型校准性能。接着,我们考察了现有校准方法在缓解过度自信问题上的有效性。除了不可学习的校准方法(如标签平滑),我们改编并扩展了两种近期提出的可学习方法,这些方法通过直接收集数据训练模型以实现合理的置信度估计。实验结果表明,可学习方法能显著降低PLMs对错误预测的置信度。代码见 \url{https://github.com/lifan-yuan/PLMCalibration}。