Large language models (LLMs) integrated with retrieval-augmented generation (RAG) systems improve accuracy by leveraging external knowledge sources. However, recent research has revealed RAG's susceptibility to poisoning attacks, where the attacker injects poisoned texts into the knowledge database, leading to attacker-desired responses. Existing defenses, which predominantly focus on inference-time mitigation, have proven insufficient against sophisticated attacks. In this paper, we introduce RAGForensics, the first traceback system for RAG, designed to identify poisoned texts within the knowledge database that are responsible for the attacks. RAGForensics operates iteratively, first retrieving a subset of texts from the database and then utilizing a specially crafted prompt to guide an LLM in detecting potential poisoning texts. Empirical evaluations across multiple datasets demonstrate the effectiveness of RAGForensics against state-of-the-art poisoning attacks. This work pioneers the traceback of poisoned texts in RAG systems, providing a practical and promising defense mechanism to enhance their security. Our code is available at: https://github.com/zhangbl6618/RAG-Responsibility-Attribution
翻译:大型语言模型(LLM)与检索增强生成(RAG)系统相结合,通过利用外部知识源提高了准确性。然而,近期研究揭示了RAG系统易受投毒攻击的脆弱性,攻击者可将投毒文本注入知识数据库,从而诱导模型生成符合攻击者意图的响应。现有防御方法主要集中于推理阶段的缓解措施,但已被证明难以应对复杂的攻击。本文提出了首个面向RAG系统的溯源框架RAGForensics,旨在识别知识库中引发攻击的投毒文本。RAGForensics采用迭代式操作流程:首先从数据库中检索文本子集,随后通过精心设计的提示词引导LLM检测潜在的投毒文本。在多数据集上的实证评估表明,RAGForensics能有效应对最先进的投毒攻击。本研究开创了RAG系统中投毒文本的溯源机制,为增强系统安全性提供了实用且具有前景的防御方案。代码已开源:https://github.com/zhangbl6618/RAG-Responsibility-Attribution