Despite the significant progress made in practical applications of aligned language models (LMs), they tend to be overconfident in output answers compared to the corresponding pre-trained LMs. In this work, we systematically evaluate the impact of the alignment process on logit-based uncertainty calibration of LMs under the multiple-choice setting. We first conduct a thoughtful empirical study on how aligned LMs differ in calibration from their pre-trained counterparts. Experimental results reveal that there are two distinct uncertainties in LMs under the multiple-choice setting, which are responsible for the answer decision and the format preference of the LMs, respectively. Then, we investigate the role of these two uncertainties on aligned LM's calibration through fine-tuning in simple synthetic alignment schemes and conclude that one reason for aligned LMs' overconfidence is the conflation of these two types of uncertainty. Furthermore, we examine the utility of common post-hoc calibration methods for aligned LMs and propose an easy-to-implement and sample-efficient method to calibrate aligned LMs. We hope our findings could provide insights into the design of more reliable alignment processes for LMs.
翻译:尽管对齐语言模型在实际应用中取得了显著进展,但与对应的预训练语言模型相比,它们在输出答案时往往表现出过度自信。本研究系统评估了对齐过程对语言模型在多项选择设定下基于对数几率的不确定性校准的影响。我们首先通过严谨的实证研究,探讨对齐语言模型与预训练语言模型在校准特性上的差异。实验结果表明,在多项选择设定下,语言模型中存在两种不同的不确定性,分别负责答案决策和语言模型的格式偏好。随后,我们通过简单合成对齐方案中的微调实验,探究这两种不确定性在对齐语言模型校准中的作用,并得出结论:对齐语言模型过度自信的原因之一是这两种不确定性的混淆。此外,我们检验了常见事后校准方法在对齐语言模型中的有效性,并提出了一种易于实现且样本高效的校准方法。我们希望这些发现能为设计更可靠的语言模型对齐流程提供启示。