Modern auto-regressive language models are trained to minimize log loss on broad data by predicting the next token so they are expected to get calibrated answers in next-token prediction tasks. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts, and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction, dialog, or reinforcement learning from human feedback (RLHF) on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling provide limited gains in calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.
翻译:现代自回归语言模型通过预测下一个令牌来最小化广泛数据的对数损失,因此它们应在下一个令牌预测任务中提供校准答案。我们针对上下文学习这一通过设计提示来调整冻结大型语言模型的广泛使用方式展开研究,并探究在多种自然语言理解与推理任务中性能与校准之间的权衡。通过大量实验,我们发现:随着模型规模增大、上下文学习示例增加,以及使用指令、对话或基于人类反馈的强化学习在精心策划的数据集上微调模型,这种权衡可能会恶化。此外,我们注意到温度缩放等广泛有效的常见重新校准技术在降低校准误差方面效果有限,这表明在要求模型具备可靠性的场景中可能需要新的方法。