Retrieval Augmented Generation (RAG) is a highly effective paradigm for keeping LLM-based responses up-to-date and reducing the likelihood of hallucinations. Yet, RAG was recently shown to be quite vulnerable to corpus knowledge poisoning: an attacker injects misleading documents to the corpus to steer an LLMs' output to an undesired response. We argue that the standard causal attention mechanism in LLMs enables harmful cross-document interactions, specifically in cases of attacks. Accordingly, we introduce a novel defense approach for RAG: Sparse Document Attention RAG (SDAG). This is a block-sparse attention mechanism that disallows cross-attention between retrieved documents. SDAG requires a minimal inference-time change to the attention mask; furthermore, no fine-tuning or additional architectural changes are needed. We present an empirical evaluation of LLM-based question answering (QA) with a variety of attack strategies on RAG. We show that our SDAG method substantially outperforms the standard causal attention mechanism in terms of attack success rate. We further demonstrate the clear merits of integrating SDAG with state-of-the-art RAG defense methods. Specifically, the integration results in performance that is statistically significantly better than the state-of-the-art.
翻译:检索增强生成(RAG)是一种极为有效的范式,能够保持基于大型语言模型(LLM)的响应与时俱进,并降低产生幻觉的可能性。然而,近期研究表明,RAG极易受到语料库知识投毒攻击:攻击者通过向语料库中注入误导性文档,从而操控LLM输出非预期的响应。我们认为,LLM中标准的因果注意力机制在遭受攻击时,会促成有害的跨文档交互。为此,我们提出一种针对RAG的新型防御方法:稀疏文档注意力RAG(SDAG)。这是一种块稀疏注意力机制,它禁止检索到的文档之间进行交叉注意力计算。SDAG仅需在推理时对注意力掩码进行最小程度的修改,且无需微调或额外的架构调整。我们基于LLM的问答(QA)任务,在多种针对RAG的攻击策略下进行了实证评估。结果表明,我们的SDAG方法在攻击成功率方面显著优于标准的因果注意力机制。我们进一步证明了将SDAG与最先进的RAG防御方法相结合具有明显优势。具体而言,这种集成方案在性能上取得了统计意义上显著优于当前最优方法的结果。