Retrieval-augmented generation (RAG) is vulnerable to prompt injection attacks, in which an adversary inserts malicious documents containing carefully crafted injected prompts into the knowledge database. When a user issues a question targeted by the attack, the RAG system may retrieve these malicious documents, whose injected prompts mislead it into generating attacker-specified answers, thereby compromising the integrity of the RAG system. In this work, we propose CleanBase, a method to detect malicious documents within a knowledge database. Our key insight is that malicious documents crafted for the same attack-targeted questions often exhibit high semantic similarity, as attackers deliberately make them consistent to improve attack success rates. Accordingly, CleanBase constructs a similarity graph over the knowledge database, where each node represents a document and an edge connects two nodes if their semantic similarity--computed using an embedding model--exceeds a statistically determined threshold. Due to their inherent similarity, malicious documents tend to form cliques within this graph. CleanBase detects such cliques and flags the corresponding documents as malicious. We theoretically derive upper bounds on CleanBase's false positive and false negative rates and empirically validate its effectiveness. Experimental results across multiple datasets and prompt injection attacks demonstrate that CleanBase accurately detects malicious documents and effectively safeguards RAG systems. Our source code is available at https://github.com/WeifeiJin/CleanBase.
翻译:检索增强生成(RAG)易受提示注入攻击,攻击者将精心构造的注入提示嵌入恶意文档并插入知识数据库。当用户发起针对攻击目标的问题时,RAG系统可能检索到这些恶意文档,其注入提示会误导系统生成攻击者指定的答案,从而破坏RAG系统的完整性。本文提出CleanBase方法,用于检测知识数据库中的恶意文档。我们的核心洞见在于:针对同一攻击目标问题构造的恶意文档往往具有高度语义相似性——攻击者会刻意使其保持一致以提高攻击成功率。据此,CleanBase在知识数据库上构建相似度图,其中每个节点代表一个文档,若两个节点的语义相似度(通过嵌入模型计算)超过统计确定的阈值,则连接这两条边。由于恶意文档固有的相似性,它们倾向于在此图中形成团结构。CleanBase检测此类团结构并将对应文档标记为恶意。我们从理论上推导了CleanBase的假阳性率和假阴性率上界,并通过实验验证其有效性。跨多个数据集和提示注入攻击的实验结果表明,CleanBase能准确检测恶意文档,有效保障RAG系统的安全性。我们的源代码已发布在 https://github.com/WeifeiJin/CleanBase。