Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Retrieval-augmented generation (RAG) is increasingly deployed in real-world applications, where its reference-grounded design makes outputs appear trustworthy. This trust has spurred research on poisoning attacks that craft malicious content, inject it into knowledge sources, and manipulate RAG responses. However, when evaluated in practical RAG systems, existing attacks suffer from severely degraded effectiveness. This gap stems from two overlooked realities: (i) content is often processed before use, which can fragment the poison and weaken its effect, and (ii) users often do not issue the exact queries anticipated during attack design. These factors can lead practitioners to underestimate risks and develop a false sense of security. To better characterize the threat to practical systems, we present Confundo, a learning-to-poison framework that fine-tunes a large language model as a poison generator to achieve high effectiveness, robustness, and stealthiness. Confundo provides a unified framework supporting multiple attack objectives, demonstrated by manipulating factual correctness, inducing biased opinions, and triggering hallucinations. By addressing these overlooked challenges, Confundo consistently outperforms a wide range of purpose-built attacks across datasets and RAG configurations by large margins, even in the presence of defenses. Beyond exposing vulnerabilities, we also present a defensive use case that protects web content from unauthorized incorporation into RAG systems via scraping, with no impact on user experience.

翻译：检索增强生成（RAG）技术在实际应用中的部署日益广泛，其基于参考依据的设计使输出结果显得可信。这种信任推动了针对RAG系统的投毒攻击研究，此类攻击通过构造恶意内容、将其注入知识源并操纵RAG响应来实现攻击目的。然而，在实用RAG系统中评估时，现有攻击方法的有效性严重下降。这种差距源于两个被忽视的现实因素：（i）内容在使用前常经过预处理，可能使投毒内容碎片化并削弱其效果；（ii）用户查询往往与攻击设计时预设的精确查询存在差异。这些因素可能导致实践者低估风险并产生错误的安全认知。为更准确刻画实用系统面临的威胁，本文提出Confundo——一种基于学习的投毒框架，通过微调大语言模型作为投毒生成器，实现高攻击有效性、鲁棒性与隐蔽性。Confundo提供统一框架支持多种攻击目标，包括操纵事实准确性、诱导偏见观点及触发幻觉生成。通过解决这些被忽视的挑战，Confundo在多种数据集和RAG配置下均显著优于各类专用攻击方法，即使在防御机制存在时仍保持优势。除揭示系统脆弱性外，本文还提出一种防御用例，可在不影响用户体验的前提下，保护网络内容免遭未经授权的爬取并纳入RAG系统。