Quantifying Faithful Confidence Expression in Large Reasoning Models

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

翻译：可靠的不确定性沟通对于大型语言模型（LLM）的可信度至关重要，然而忠实校准（FC）——即模型内在置信度与（语言层面）表达置信度之间的一致性——始终是一个持续存在的失效模式。这一挑战对大型推理模型（LRM）尤为关键，因为其扩展的推理轨迹常被用户解读为深思熟虑、能力与置信度的证据。尽管FC至关重要且LRM被广泛使用，但LRM能在多大程度上忠实表达其置信度仍知之甚少。此外，衡量FC的主流范式难以很好地推广至LRM生成的长链式推理输出中，这类输出往往缺乏清晰的步骤边界、步骤结构不一致，且在整个推理轨迹中编码了复杂的条件依赖关系——这给内在置信度的估计造成了复杂化。为应对这一挑战，我们提出了一种新颖的框架，用于系统量化LRM的FC。该框架基于令牌概率、隐藏状态和采样响应一致性，分析了相对于三种内部不确定性来源的语言决策确定性。我们还设计了一种前缀条件采样方法，以控制轨迹间的条件与结构变异。将我们的框架应用于一系列领先模型、数据集和提示词后，我们发现忠实置信度表达对LRM而言是一项重大挑战。推理行为并不会自动转化为FC的改进，且针对非推理模型的提示干预措施在推理场景中也无法提升忠实性。不同置信度估计器对同一推理轨迹的评估结果存在差异，这揭示了先前评估方法的脆弱性。综上所述，本研究将FC确立为LRM的一个独特可靠性及对齐目标，尤其是在这些系统越来越多地被部署于高风险场景之际。