Knowledge Database or Poison Base? Detecting RAG Poisoning Attack through LLM Activations

As Large Language Models (LLMs) are progressively deployed across diverse fields and real-world applications, ensuring the security and robustness of LLMs has become ever more critical. Retrieval-Augmented Generation (RAG) is a cutting-edge approach designed to address the limitations of large language models (LLMs). By retrieving information from the relevant knowledge database, RAG enriches the input to LLMs, enabling them to produce responses that are more accurate and contextually appropriate. It is worth noting that the knowledge database, being sourced from publicly available channels such as Wikipedia, inevitably introduces a new attack surface. RAG poisoning involves injecting malicious texts into the knowledge database, ultimately leading to the generation of the attacker's target response (also called poisoned response). However, there are currently limited methods available for detecting such poisoning attacks. We aim to bridge the gap in this work. Particularly, we introduce RevPRAG, a flexible and automated detection pipeline that leverages the activations of LLMs for poisoned response detection. Our investigation uncovers distinct patterns in LLMs' activations when generating correct responses versus poisoned responses. Our results on multiple benchmark datasets and RAG architectures show our approach could achieve 98% true positive rate, while maintaining false positive rates close to 1%. We also evaluate recent backdoor detection methods specifically designed for LLMs and applicable for identifying poisoned responses in RAG. The results demonstrate that our approach significantly surpasses them.

翻译：随着大语言模型（LLMs）在各领域和实际应用中的逐步部署，确保其安全性与鲁棒性变得日益关键。检索增强生成（RAG）是一种前沿方法，旨在解决大语言模型的局限性。通过从相关知识库中检索信息，RAG丰富了大语言模型的输入，使其能够生成更准确且符合上下文的回答。值得注意的是，由于知识库通常来源于维基百科等公开渠道，这不可避免地引入了新的攻击面。RAG投毒攻击通过向知识库注入恶意文本，最终诱导模型生成攻击者预设的目标回答（亦称中毒回答）。然而，目前检测此类投毒攻击的方法十分有限。本研究旨在填补这一空白。具体而言，我们提出了RevPRAG——一个灵活且自动化的检测框架，该框架利用大语言模型的激活状态进行中毒回答检测。我们的研究发现，大语言模型在生成正常回答与中毒回答时的激活模式存在显著差异。在多个基准数据集和RAG架构上的实验结果表明，我们的方法能够实现98%的真阳性率，同时将假阳性率控制在接近1%的水平。我们还评估了近期专门为大语言模型设计且适用于RAG场景中中毒回答检测的后门检测方法。结果显示，我们的方法显著优于现有技术。