A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions. We model the problem as a betting game where the model predicts a confidence score together with every answer, and design a reward function that penalizes both over and under-confidence. We prove that under our reward design an optimal policy would result in a perfectly calibrated confidence estimation. Our experiments demonstrate significantly improved confidence calibration and generalization to new tasks without re-training, indicating that our approach teaches a general confidence awareness. This approach enables the training of inherently calibrated LLMs.
翻译:大型语言模型(LLMs)的安全可信使用要求其答案的置信度表达必须准确。本文提出一种新颖的强化学习(RL)方法用于LLM校准,通过对LLMs进行微调,使其在回答事实性问题时能够产生经过校准的置信度估计。我们将该问题建模为投注游戏:模型在给出每个答案时需同时预测置信度分数,并设计了一种同时惩罚过度自信与自信不足的奖励函数。我们证明,在该奖励设计下,最优策略将产生完全校准的置信度估计。实验结果表明,该方法显著提升了置信度校准效果,并能无需重新训练即可泛化至新任务,说明我们的方法能够教会模型具备通用的置信度意识。这一方法使得训练具有内在校准特性的LLMs成为可能。