Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks. This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems. Our code is available at: https://github.com/zhangbl6618/RAG-Responsibility-Attribution
翻译:检索增强生成(RAG)通过整合外部知识来提升大语言模型的响应质量。然而,近期研究表明,RAG系统极易受到投毒攻击,即恶意文本被插入知识库以影响模型输出。尽管已有多种防御方案被提出,但它们往往会被更具适应性或更复杂的攻击所规避。本文提出RAGOrigin,一种黑盒责任归因框架,旨在识别知识库中哪些文本应对误导性或错误生成负责。我们的方法为每个错误生成事件构建定制化的聚焦归因范围,并通过评估候选文本的检索排序、语义相关性以及对生成响应的影响,为每个候选文本分配责任分数。随后,系统采用无监督聚类方法隔离被污染文本。我们在七个数据集和十五种投毒攻击(包括新开发的自适应投毒策略和多攻击者场景)上评估RAGOrigin。该方法在识别污染内容方面优于现有基线,并在动态及噪声条件下保持鲁棒性。这些结果表明,RAGOrigin为追溯RAG系统中知识污染的来源提供了实用且有效的解决方案。代码已开源:https://github.com/zhangbl6618/RAG-Responsibility-Attribution