Accurate uncertainty quantification is crucial for the safe deployment of language models (LMs), and prior research has demonstrated improvements in the calibration of modern LMs. Our study focuses on in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examines the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations, suggesting that new methods may be required for scenarios where models are expected to be reliable.
翻译:精准的不确定性量化对于语言模型的安全部署至关重要,已有研究表明现代语言模型的校准性能已得到提升。本研究聚焦于上下文学习——一种通过定制提示适配静态语言模型的常用方法,系统探讨了在自然语言理解与推理任务中,模型性能与校准之间的平衡关系。通过全面实验,我们观察到:随着上下文学习样例数量的增加,模型在校准性能改善前会先出现校准偏差加剧现象,且校准偏差更易出现在小样本场景中。此外,我们发现旨在提升模型可用性的方法(如微调与思维链提示)可能导致校准偏差和不可靠的自然语言解释,这表明在需要模型具备可靠性的场景中,可能需要开发新的方法。