Neutralizing Backdoors through Information Conflicts for Large Language Models

Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.

翻译：大型语言模型（LLMs）已取得显著进展，在从理解到推理的各类自然语言处理（NLP）任务中均展现出卓越性能。然而，它们仍易受后门攻击的影响——此类攻击下，模型对标准查询表现正常，但在特定触发器被激活时会产生有害响应或非预期输出。现有后门防御方法常存在局限：或仅聚焦于检测而无法消除后门，或依赖对触发器特性的刚性假设，或对多触发器后门等高级攻击效果有限。本文提出一种通过构建信息冲突来消除LLMs后门行为的新方法，该方法同时运用内部与外部机制。在内部，我们利用轻量级数据集训练冲突模型，随后将其与受后门感染的模型融合，通过向模型参数化记忆中嵌入矛盾信息来中和恶意行为。在外部，我们将具有说服力的矛盾证据融入提示词中，以挑战模型内部的后门知识。在4个广泛使用的LLMs上进行的分类与会话任务实验表明，本方法在8个先进后门防御基线中表现最优。我们能够将高级后门攻击的成功率降低高达98%，同时保持超过90%的干净数据准确率。此外，本方法被证实对自适应后门攻击具有鲁棒性。相关代码将在论文发表后开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日