Large language models (LLMs) have raised concerns about potential security threats despite performing significantly in Natural Language Processing (NLP). Backdoor attacks initially verified that LLM is doing substantial harm at all stages, but the cost and robustness have been criticized. Attacking LLMs is inherently risky in security review, while prohibitively expensive. Besides, the continuous iteration of LLMs will degrade the robustness of backdoors. In this paper, we propose TrojanRAG, which employs a joint backdoor attack in the Retrieval-Augmented Generation, thereby manipulating LLMs in universal attack scenarios. Specifically, the adversary constructs elaborate target contexts and trigger sets. Multiple pairs of backdoor shortcuts are orthogonally optimized by contrastive learning, thus constraining the triggering conditions to a parameter subspace to improve the matching. To improve the recall of the RAG for the target contexts, we introduce a knowledge graph to construct structured data to achieve hard matching at a fine-grained level. Moreover, we normalize the backdoor scenarios in LLMs to analyze the real harm caused by backdoors from both attackers' and users' perspectives and further verify whether the context is a favorable tool for jailbreaking models. Extensive experimental results on truthfulness, language understanding, and harmfulness show that TrojanRAG exhibits versatility threats while maintaining retrieval capabilities on normal queries.
翻译:大型语言模型(LLM)虽然在自然语言处理(NLP)领域表现卓越,但也引发了人们对潜在安全威胁的担忧。后门攻击最初验证了LLM在各个阶段均可能造成实质性危害,但其成本与鲁棒性一直备受质疑。在安全审查中攻击LLM本身具有风险,且代价极其高昂。此外,LLM的持续迭代会降低后门的鲁棒性。本文提出TrojanRAG,通过在检索增强生成中实施联合后门攻击,从而在通用攻击场景中操控LLM。具体而言,攻击者构建精细的目标上下文与触发器集合。通过对比学习对多组后门捷径进行正交优化,从而将触发条件约束在参数子空间中以提升匹配度。为提高RAG对目标上下文的召回率,我们引入知识图谱构建结构化数据,以实现细粒度层面的硬匹配。此外,我们规范化了LLM中的后门场景,分别从攻击者与用户视角分析后门造成的实际危害,并进一步验证上下文是否可作为越狱模型的有效工具。在真实性、语言理解与危害性方面的大量实验结果表明,TrojanRAG在保持正常查询检索能力的同时,展现出多方面的威胁能力。