Large language models (LLMs) are increasingly deployed in decision-making tasks, where not only accuracy but also reliable confidence estimates are essential. Well-calibrated confidence enables downstream systems to decide when to trust a model and when to defer to fallback mechanisms. In this work, we conduct a systematic study of calibration in two widely used fine-tuning paradigms: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). We show that while RLVR improves task performance, it produces extremely overconfident models, whereas SFT yields substantially better calibration, even under distribution shift, though with smaller performance gains. Through targeted experiments, we diagnose RLVR's failure, showing that decision tokens act as extraction steps of the decision in reasoning traces and do not carry confidence information, which prevents reinforcement learning from surfacing calibrated alternatives. Based on this insight, we propose a calibration-aware reinforcement learning formulation that directly adjusts decision-token probabilities. Our method preserves RLVR's accuracy level while mitigating overconfidence, reducing ECE scores up to 9 points.
翻译:大语言模型(LLMs)正日益被应用于决策任务中,此类任务不仅要求准确性,也依赖可靠的信度估计。良好校准的信度能使下游系统判断何时信任模型、何时启用后备机制。本研究系统性地考察了两种广泛使用的微调范式中的校准问题:监督微调(SFT)与基于可验证奖励的强化学习(RLVR)。我们发现,虽然RLVR能提升任务性能,但会产生极度过度自信的模型;而SFT即使在分布偏移条件下也能实现显著更优的校准效果,尽管其性能增益相对较小。通过针对性实验,我们诊断出RLVR的失效机制:决策标记在推理轨迹中仅作为决策的提取步骤,并不携带信度信息,这导致强化学习无法呈现经过校准的替代方案。基于此发现,我们提出一种校准感知的强化学习框架,直接调整决策标记的概率分布。该方法在保持RLVR准确率水平的同时有效缓解过度自信问题,将预期校准误差(ECE)分数降低高达9个点。