Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.
翻译:大型语言模型(LLMs)已取得显著进展,在从理解到推理的各类自然语言处理(NLP)任务中均展现出卓越性能。然而,它们仍易受后门攻击的影响——此类攻击下,模型对标准查询表现正常,但在特定触发器被激活时会产生有害响应或非预期输出。现有后门防御方法常存在局限:或仅聚焦于检测而无法消除后门,或依赖对触发器特性的刚性假设,或对多触发器后门等高级攻击效果有限。本文提出一种通过构建信息冲突来消除LLMs后门行为的新方法,该方法同时运用内部与外部机制。在内部,我们利用轻量级数据集训练冲突模型,随后将其与受后门感染的模型融合,通过向模型参数化记忆中嵌入矛盾信息来中和恶意行为。在外部,我们将具有说服力的矛盾证据融入提示词中,以挑战模型内部的后门知识。在4个广泛使用的LLMs上进行的分类与会话任务实验表明,本方法在8个先进后门防御基线中表现最优。我们能够将高级后门攻击的成功率降低高达98%,同时保持超过90%的干净数据准确率。此外,本方法被证实对自适应后门攻击具有鲁棒性。相关代码将在论文发表后开源。