Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.
翻译:大语言模型(LLMs)往往表达的置信度分数与实际准确性存在显著脱节,但支配这种行为的几何关系仍未被充分理解。本研究对口头化置信度进行了机制可解释性分析,通过线性探针和对比激活添加(CAA)引导方法表明:校准信号与口头化置信度信号以线性方式编码,但二者相互正交——这一发现针对三种开源模型和四个数据集均具有一致性。值得关注的是,当模型被提示同时进行问题推理和置信度评分时,推理过程会干扰口头化置信度方向,加剧校准偏差。我们将此现象称为“推理污染效应”。基于这一发现,我们提出两阶段自适应引导框架:通过读取模型内部准确率估计值,引导口头化输出与其匹配,从而在所有评估模型中显著提升校准对齐度。