Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.
翻译:检索增强生成(RAG)已成为通过整合外部最新知识源来增强大语言模型(LLM)能力的一种强大方法。然而,这也引入了知识投毒攻击的潜在漏洞,攻击者可借此破坏知识源以误导生成模型。其中一种攻击是 PoisonedRAG,其注入的对抗性文本会引导模型针对目标问题生成攻击者预先选择的回答。在本工作中,我们提出了新型防御方法 FilterRAG 与 ML-FilterRAG 来缓解 PoisonedRAG 攻击。首先,我们提出一种新性质,用以揭示知识数据源中对抗性文本与正常文本之间的区别性特征。随后,在设计的方法中利用该性质从正常文本中过滤出对抗性文本。基于基准数据集的评估结果表明,这些方法具有有效性,其性能可接近原始 RAG 系统的水平。