Large language models (LLMs) often generate fluent but factually incorrect statements despite having access to relevant evidence, a failure mode rooted in how they allocate attention between contextual and parametric knowledge. Understanding and steering this internal behavior is key both for trustworthy deployment and for scientific interpretability of model mechanisms. We introduce COMPASS (Context-Modulated PID Attention Steering System), a lightweight, interpretable control framework that embeds a model-based feedback loop directly within decoding. COMPASS quantifies context reliance via a transparent metric, the Context Reliance Score (CRS), which serves as an online probe of how attention heads ground generation in evidence. Using this interpretable signal, a PID controller dynamically modulates attention heads to maintain factual consistency without retraining or multi-pass decoding. Across benchmarks (HotpotQA, XSum, HaluEval, RAGTruth), COMPASS consistently reduces contextual hallucination rates (2.8 to 5.8 percent absolute) while revealing how distinct attention heads contribute to evidence alignment. These results highlight feedback-driven interpretability as a pathway toward scientific understanding of LLM behavior.
翻译:大型语言模型(LLMs)尽管能够访问相关证据,却常常生成流畅但事实错误的陈述,这种失败模式根源于其在上下文知识与参数化知识之间分配注意力的方式。理解并引导这种内部行为对于可信部署以及模型机制的科学可解释性至关重要。我们提出了COMPASS(上下文调制PID注意力引导系统),这是一种轻量级、可解释的控制框架,将基于模型的反馈循环直接嵌入解码过程中。COMPASS通过一个透明指标——上下文依赖分数(CRS)——量化上下文依赖程度,该分数作为在线探针,用于评估注意力头如何将生成过程锚定于证据。利用这一可解释信号,PID控制器动态调制注意力头,以保持事实一致性,而无需重新训练或多轮解码。在多个基准测试(HotpotQA、XSum、HaluEval、RAGTruth)中,COMPASS持续降低了上下文幻觉率(绝对降低2.8%至5.8%),同时揭示了不同注意力头如何促进证据对齐。这些结果突显了反馈驱动的可解释性作为理解LLM行为的科学途径。