A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We introduce a novel Reinforcement Learning (RL) approach for LLM calibration that fine-tunes LLMs to elicit calibrated confidence estimations in their answers to factual questions. We model the problem as a betting game where the model predicts a confidence score together with every answer, and design a reward function that penalizes both over and under-confidence. We prove that under our reward design an optimal policy would result in a perfectly calibrated confidence estimation. Our experiments demonstrate significantly improved confidence calibration and generalization to new tasks without re-training, indicating that our approach teaches a general confidence awareness. This approach enables the training of inherently calibrated LLMs.
翻译:大语言模型(LLMs)的安全可信使用要求其能对自身答案的置信度进行准确表达。本文提出一种新颖的强化学习方法用于LLM校准,通过微调LLM使其在回答事实性问题时能产生经过校准的置信度估计。我们将该问题建模为投注游戏:模型在给出每个答案时需同时预测置信度分数,并设计了能同时惩罚过度自信与自信不足的奖励函数。我们证明,在该奖励设计下,最优策略将产生完全校准的置信度估计。实验结果表明,该方法能显著提升置信度校准效果,且无需重新训练即可泛化至新任务,说明我们的方法能够教会模型形成普遍的置信度意识。这一方法为训练具有内在校准能力的LLMs提供了可行路径。