Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
翻译:大型语言模型(LLMs)可通过对其输出结果进行置信度表达来提升用户感知的可信度。然而,已有研究表明LLMs常表现出过度自信,导致其陈述的置信度不可靠,因其与事实准确性并不总保持一致。为深入理解这种置信表达的来源,本文提出TracVC(置信表达溯源)方法,该方法融合信息检索与影响力估计技术,将生成的置信表达追溯至训练数据。我们在问答场景下对OLMo与Llama系列模型进行评估,并提出新指标——内容关联度,用于衡量LLM的置信表达在多大程度上基于与问答内容相关的训练样本(即与问题及答案相关的示例),而非泛化的置信表达示例。分析表明,OLMo2-13B模型常受到与查询词汇无关的置信相关数据影响,暗示其可能模仿确定性的表层语言表达,而非依赖实质性的内容关联。这些发现揭示了当前训练机制的根本局限:LLMs可能学会了如何展现自信,却未掌握何时应具备合理置信。本研究为提升LLMs在表达更可靠置信度方面的可信度奠定了理论基础。