Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.
翻译:尽管大语言模型(LLMs)展现出令人印象深刻的能力,但其频繁产生幻觉的问题依然存在。先前研究表明,模型内部状态编码了丰富的真实性信号,然而这些信号的起源与作用机制尚不明确。本文揭示真实性线索源于两条独立的信息通路:(1)依赖问题-答案信息流的“问题锚定”通路;(2)从生成答案自身提取自洽证据的“答案锚定”通路。我们首先通过注意力阻断与标记修补技术验证并解耦了这两条通路。随后,我们揭示了这两种机制显著而有趣的性质。进一步实验表明:(1)这两种机制与大语言模型的知识边界密切相关;(2)内部表征能够感知其间的差异。基于这些深刻发现,我们最终提出两种提升幻觉检测性能的应用方案。总体而言,本研究为大语言模型内部如何编码真实性提供了新的见解,为构建更可靠、更具自我意识的生成系统指明了方向。