Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models that engage in extended chain-of-thought (CoT) reasoning exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models (e.g., exploring alternative approaches and backtracking) which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that non-reasoning models also demonstrate enhanced calibration when simply guided to slow think via in-context learning, fully isolating slow thinking as the source of the calibration gains.
翻译:尽管大型语言模型(LLMs)能力强大,但它们往往无法准确传达其置信度,这使得评估其何时可能出错变得困难,并限制了其可靠性。在本研究中,我们证明,那些进行扩展思维链(CoT)推理的推理模型不仅在问题解决方面表现更优,而且在准确表达其置信度方面也更为出色。具体而言,我们在六个数据集上对六种推理模型进行了基准测试,发现在全部36种设置中,有33种设置下,这些模型比非推理模型实现了严格更优的置信度校准。我们的详细分析表明,这些校准能力的提升源于推理模型的慢思考行为(例如,探索替代方法和回溯),这些行为使它们能够在整个思维链推理过程中动态调整其置信度,使其逐渐变得更加准确。特别地,我们发现,随着思维链推理的展开,推理模型的校准效果越来越好,这一趋势在非推理模型中并未观察到。此外,从思维链推理中移除慢思考行为会导致校准效果显著下降。最后,我们证明,非推理模型在仅通过上下文学习被引导进行慢思考时,其校准能力也能得到增强,这完全将慢思考行为确立为校准能力提升的根源。