Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.
翻译:尽管大型语言模型(LLMs)在自然语言处理(NLP)领域表现卓越,但其潜在安全威胁已引发广泛关注。后门攻击虽已证实LLMs在各阶段均可能造成实质性危害,但其攻击成本与鲁棒性备受质疑。在安全审查中攻击LLMs本身具有高风险性,且代价极其高昂。此外,LLMs的持续迭代会削弱后门的鲁棒性。本文提出特洛伊检索增强生成(TrojanRAG),通过在检索增强生成中实施联合后门攻击,实现对LLMs在通用攻击场景下的操控。具体而言,攻击者构建精细的目标上下文与触发器集合,通过对比学习正交优化多组后门捷径,从而将触发条件约束在参数子空间以提升匹配精度。为增强RAG对目标上下文的召回能力,我们引入知识图谱构建结构化数据,实现细粒度层面的硬匹配。此外,我们规范化LLMs中的后门场景,分别从攻击者与用户视角分析后门造成的实际危害,并进一步验证上下文是否可作为越狱模型的有效工具。在真实性、语言理解与危害性方面的广泛实验结果表明,TrojanRAG在保持正常查询检索能力的同时,展现出多方面的威胁能力。