Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. RAG poisoning is an attack method to induce LLMs to generate the attacker's expected text by injecting poisoned documents into the database of RAG systems. Existing research can be broadly divided into two classes: white-box methods and black-box methods. White-box methods utilize gradient information to optimize poisoned documents, and black-box methods use a pre-trained LLM to generate them. However, existing white-box methods require knowledge of the RAG system's internal composition and implementation details, whereas black-box methods are unable to utilize interactive information. In this work, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box and leverages our proposed Reinforcement Learning from Black-box Feedback (RLBF) method to optimize the generation model for poisoned documents. We designed two kinds of rewards: similarity reward and attack reward. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.
翻译:基于大语言模型(LLM)的检索增强生成(RAG)系统已成为问答(QA)和内容生成等任务的核心技术。RAG投毒是一种通过向RAG系统数据库中注入投毒文档,诱导LLM生成攻击者预期文本的攻击方法。现有研究大致可分为两类:白盒方法与黑盒方法。白盒方法利用梯度信息优化投毒文档,而黑盒方法则使用预训练的LLM生成投毒文档。然而,现有白盒方法需要了解RAG系统的内部构成与实现细节,而黑盒方法则无法利用交互信息。在本工作中,我们提出了RIPRAG攻击框架,这是一种端到端的攻击流程,将目标RAG系统视为黑盒,并利用我们提出的基于黑盒反馈的强化学习(RLBF)方法来优化投毒文档的生成模型。我们设计了两类奖励:相似性奖励和攻击奖励。实验结果表明,该方法能有效对大多数复杂的RAG系统执行投毒攻击,相比基线方法,攻击成功率(ASR)最高可提升0.72。这揭示了当前防御方法中普遍存在的缺陷,并为LLM安全研究提供了重要启示。