Modern auto-regressive language models are trained to minimize log loss on broad data by predicting the next token so they are expected to get calibrated answers when framing a problem as a next-token prediction task. We study this for in-context learning (ICL), a widely used way to adapt frozen large language models (LLMs) via crafting prompts, and investigate the trade-offs between performance and calibration on a wide range of natural language understanding and reasoning tasks. We conduct extensive experiments to show that such trade-offs may get worse as we increase model size, incorporate more ICL examples, and fine-tune models using instruction, dialog, or reinforcement learning from human feedback (RLHF) on carefully curated datasets. Furthermore, we find that common recalibration techniques that are widely effective such as temperature scaling provide limited gains in calibration errors, suggesting that new methods may be required for settings where models are expected to be reliable.
翻译:现代自回归语言模型通过预测下一个标记在大规模数据上进行训练,以最小化对数损失,因此当任务被构建为下一个标记预测时,这些模型有望获得校准后的答案。我们针对一种广泛使用的适应冻结大语言模型(LLM)的方法——上下文学习(ICL)展开研究,考察其在多种自然语言理解与推理任务中性能与校准之间的权衡关系。通过大量实验,我们发现:随着模型规模增大、上下文学习示例增多,或使用指令、对话及基于人类反馈的强化学习(RLHF)在精心筛选的数据集上对模型进行微调时,这种权衡可能会进一步恶化。此外,我们发现诸如温度缩放等广泛有效的常见重校准技术在降低校准误差方面作用有限,这表明在需要模型高度可靠的应用场景中,可能需要开发新的校准方法。