Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment: whether a model's confidence in its committed answer is justified by its generated rationale. We introduce a GRPO-based reinforcement learning framework that jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support, where the rubric assesses grounding, coherence, task match, and connection to the selected answer without revealing the gold answer to the judge. Across MedQA, MathQA, and OpenBookQA using three open-weight LLMs, our method reduces the confidence--rationale alignment error by up to 26.51% compared with untuned checkpoints, SFT, and correctness-only GRPO, while maintaining competitive accuracy and often improving calibration. These results show that reliable CoT reasoning requires not only confident answers, but rationales that substantively support them.
翻译:链式思维推理能提升大语言模型性能,但当伴随的链式思维理由看似合理却不完整或缺乏充分支撑时,高答案置信度可能具有误导性。我们研究置信度—理由对齐问题:即模型对其选定答案的置信度是否得到其生成理由的充分支持。我们提出一种基于GRPO的强化学习框架,该框架联合奖励答案正确性、选定答案概率以及基于评分标准的理由支持度(评分机制在不向评判者透露真实答案的前提下,评估理由的立足依据、连贯性、任务匹配度及与所选答案的关联性)。在MedQA、MathQA和OpenBookQA数据集上,使用三种开源大语言模型的实验表明:与未调优基线、SFT及仅优化正确性的GRPO相比,本方法将置信度—理由对齐误差最多降低26.51%,同时保持具有竞争力的准确率并经常改善校准效果。这些结果表明,可靠的链式思维推理不仅需要高置信度的答案,更需要实质支撑该答案的理由。