Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge, making them adaptable and cost-effective for various applications. However, the growing reliance on these systems also introduces potential security risks. In this work, we reveal a novel vulnerability, the retrieval prompt hijack attack (HijackRAG), which enables attackers to manipulate the retrieval mechanisms of RAG systems by injecting malicious texts into the knowledge database. When the RAG system encounters target questions, it generates the attacker's pre-determined answers instead of the correct ones, undermining the integrity and trustworthiness of the system. We formalize HijackRAG as an optimization problem and propose both black-box and white-box attack strategies tailored to different levels of the attacker's knowledge. Extensive experiments on multiple benchmark datasets show that HijackRAG consistently achieves high attack success rates, outperforming existing baseline attacks. Furthermore, we demonstrate that the attack is transferable across different retriever models, underscoring the widespread risk it poses to RAG systems. Lastly, our exploration of various defense mechanisms reveals that they are insufficient to counter HijackRAG, emphasizing the urgent need for more robust security measures to protect RAG systems in real-world deployments.
翻译:检索增强生成(RAG)系统通过整合外部知识来增强大语言模型(LLMs),使其能够适应多种应用场景且具有成本效益。然而,对这些系统日益增长的依赖也带来了潜在的安全风险。本文揭示了一种新型漏洞——检索提示劫持攻击(HijackRAG),该攻击允许攻击者通过向知识库中注入恶意文本来操纵RAG系统的检索机制。当RAG系统遇到目标问题时,它会生成攻击者预先设定的答案而非正确答案,从而破坏系统的完整性和可信度。我们将HijackRAG形式化为一个优化问题,并针对攻击者不同知识水平提出了黑盒与白盒攻击策略。在多个基准数据集上的大量实验表明,HijackRAG始终能实现较高的攻击成功率,优于现有基线攻击方法。此外,我们证明了该攻击在不同检索器模型间具有可迁移性,凸显了其对RAG系统构成的广泛风险。最后,通过对多种防御机制的探索,我们发现现有防御措施均不足以有效抵御HijackRAG,这强调了在实际部署中亟需采取更鲁棒的安全措施来保护RAG系统。