When answering user queries, LLMs often retrieve knowledge from external sources stored in retrieval-augmented generation (RAG) databases. These are often populated from unvetted sources, e.g. the open web, and can contain maliciously crafted data. This paper studies attacks that can manipulate the context retrieved by LLMs from such RAG databases. Prior work on such context manipulation primarily injects false or toxic content, which can often be detected by fact-checking or linguistic analysis. We reveal a more subtle threat, Epistemic Bias Injection (EBI), in which adversaries inject factually correct yet epistemically biased passages that systematically emphasize one side of a multi-viewpoint issue. Although linguistically coherent and truthful, such adversarial passages effectively crowd out alternative viewpoints and steer model outputs toward an attacker-chosen stance. As a core contribution, we propose a novel characterization of the problem: We give a geometric metric that quantifies epistemic bias. This metric can be computed directly on embeddings of text passages retrieved by the LLM. Leveraging this metric, we construct EBI attacks and develop a lightweight prototype defense called BiasDef for them. We evaluate them both on a comprehensive benchmark constructed from public question answering datasets.Our results show that: (1) the proposed attack induces significant perspective shifts, effectively evading existing retrieval-based sanitization defenses, and (2) BiasDef substantially reduces adversarial retrieval and bias in LLM's answers. Overall, this demonstrates the new threat as well as the ease of employing epistemic bias metrics for filtering in RAG-enabled LLMs.
翻译:当回答用户查询时,大型语言模型(LLMs)通常从检索增强生成(RAG)数据库存储的外部来源中检索知识。这些数据库常从未经审查的来源(如开放网络)中填充数据,可能包含恶意构造的信息。本文研究了能够操纵LLM从此类RAG数据库检索上下文的攻击方式。先前关于此类上下文操纵的工作主要注入虚假或有害内容,这些内容通常可通过事实核查或语言分析检测。我们揭示了一种更隐蔽的威胁——认知偏差注入(EBI),即攻击者注入事实正确但在认知上存在偏差的段落,这些段落系统性地强调多视角问题中某一方的观点。尽管这些对抗性段落语言连贯且符合事实,却能有效排挤其他视角,将模型输出导向攻击者选择的立场。作为核心贡献,我们提出了一种对该问题的新表征:给出了一个量化认知偏差的几何度量。该度量可直接在LLM检索的文本段落嵌入上计算。基于此度量,我们构建了EBI攻击,并开发了一种轻量级原型防御方法BiasDef。我们在由公开问答数据集构建的综合基准上对两者进行了评估。结果表明:(1)所提出的攻击能引发显著的观点转变,有效规避现有基于检索的净化防御;(2)BiasDef能大幅减少LLM回答中的对抗性检索和偏差。总体而言,这展示了新威胁以及将认知偏差度量用于RAG型LLM过滤的便捷性。