Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory, a phenomenon often termed sycophancy or compliance. However, the mechanistic realization of this behavior remains obscure, specifically how the model resolves these knowledge conflicts through compliance, and whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream. To resolve this, we conducted a layer-wise geometric analysis across Qwen-4B, Llama-3.1-8B, and GLM-4-9B, decomposing the residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components. Our empirical results reject the universality of the "Manifold Dilution" hypothesis, as two of the three architectures maintained stable residual norms despite exhibiting significant performance degradation on factual queries. Instead, we observed that compliance is consistently characterized by "Orthogonal Interference," where the conflicting context injects a steering vector that is quasi-orthogonal to the ground-truth direction, effectively rotating the hidden state representation. This suggests that models do not "unlearn" or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement to bypass the correct unembedding vector, effectively simulating adoption while preserving the original structural magnitude. These findings challenge scalar confidence metrics for detecting hallucinations and underscore the necessity of vectorial monitoring to distinguish between genuine knowledge integration and superficial in-context mimicry.
翻译:大语言模型(LLMs)常优先处理上下文中相互冲突的信息,而非其固有的参数化记忆,这一现象通常被称为迎合性或顺从性。然而,该行为的机制实现仍不明确,特别是模型如何通过顺从性解决这些知识冲突,以及这种抑制是源于残差流中信号幅度的稀释还是方向性几何结构的改变。为解决此问题,我们在Qwen-4B、Llama-3.1-8B和GLM-4-9B模型上进行了分层几何分析,将反事实上下文诱导的残差流更新分解为径向(基于范数)和角度(基于余弦)分量。我们的实证结果否定了“流形稀释”假说的普适性,因为三种架构中有两种尽管在事实查询上表现出显著的性能下降,其残差范数仍保持稳定。相反,我们观察到顺从性一致地表现为“正交干扰”,即冲突上下文注入一个与真实方向近似正交的转向向量,从而有效地旋转隐藏状态表示。这表明模型并非“遗忘”或抑制内部事实的幅度,而是采用一种几何位移机制来绕过正确的解嵌入向量,在保持原始结构幅度的同时有效地模拟采纳行为。这些发现对基于标量置信度指标检测幻觉的方法提出了挑战,并强调了向量监控的必要性,以区分真正的知识整合与肤浅的上下文模仿。