Hallucinations in large language models (LLMs) produce fluent continuations that are not supported by the prompt, especially under minimal contextual cues and ambiguity. We introduce Distributional Semantics Tracing (DST), a model-native method that builds layer-wise semantic maps at the answer position by decoding residual-stream states through the unembedding, selecting a compact top-$K$ concept set, and estimating directed concept-to-concept support via lightweight causal tracing. Using these traces, we test a representation-level hypothesis: hallucinations arise from correlation-driven representational drift across depth, where the residual stream is pulled toward a locally coherent but context-inconsistent concept neighborhood reinforced by training co-occurrences. On Racing Thoughts dataset, DST yields more faithful explanations than attribution, probing, and intervention baselines under an LLM-judge protocol, and the resulting Contextual Alignment Score (CAS) strongly predicts failures, supporting this drift hypothesis.
翻译:大语言模型(LLMs)中的幻觉会产生流畅但与提示无关的续写,尤其在上下文线索极少且存在歧义的情况下尤为明显。我们提出分布语义追踪(DST),这是一种模型原生方法,通过以下步骤在答案位置构建层级语义图:通过解嵌入层解码残差流状态,选取紧凑的Top-$K$概念集,并通过轻量级因果追踪估计概念间的有向支持关系。利用这些追踪轨迹,我们检验了一个表征层面的假设:幻觉源于相关性驱动的跨深度表征漂移,即残差流被拉向一个局部连贯但与上下文不一致的概念邻域,该邻域由训练数据中的共现关系所强化。在Racing Thoughts数据集上,基于LLM评判协议的实验表明,DST相比归因、探测和干预基线方法能提供更忠实可靠的解释,且由此衍生的上下文对齐分数(CAS)能有效预测模型失败案例,从而支持了上述表征漂移假说。