Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Removing such unknown backdoors is particularly challenging when the defender does not know the backdoor attack types or the internal mechanisms formed through backdoor training. In this work, we propose a simple but effective backdoor removal method based on shared internal mechanisms across different backdoors. First, we show that different backdoors with the same task (attack objective) induce similar trigger-activated changes in the internal activations. Motivated by this observation, our method intentionally embeds a backdoor with a known trigger (\emph{dummy backdoor}) and then removes it through further fine-tuning on dummy-triggered inputs paired with clean responses. Since the dummy backdoor and the unknown backdoor can rely on shared internal mechanisms, removing the dummy backdoor also reduces the effect of the unknown backdoor. We evaluate our method on three backdoor attack types across multiple model families. Experimental results show that our method substantially reduces the attack success rate of the unknown backdoor while preserving model utility, outperforming representative existing defense methods in both backdoor removal effectiveness and utility preservation. These findings suggest that a defender-controllable backdoor can serve as a helpful proxy for mitigating unknown backdoors in generative LLMs.
翻译:摘要:后门攻击对大语言模型的安全性与可靠性构成严重威胁,因为此类攻击会使模型在处理正常输入时表现正常,但在隐藏触发器存在时生成攻击者指定的响应。当防御者不了解后门攻击类型或通过后门训练形成的内部机制时,消除此类未知后门尤为困难。本文提出一种基于不同后门间共享内部机制的简单而有效的后门消除方法。首先,我们证明具有相同任务(攻击目标)的不同后门会在内部激活中诱发相似的触发器激活变化。受此观察启发,我们的方法有意嵌入一个已知触发器的后门(替身后门),然后通过在替身触发器输入与正常响应对上进行进一步微调来消除该后门。由于替身后门与未知后门可能依赖共享内部机制,消除替身后门同时能降低未知后门的影响。我们在多个模型家族上对三种后门攻击类型进行了评估。实验结果表明,我们的方法在保持模型效用的同时显著降低了未知后门的攻击成功率,在后门消除效果与效用保持方面均优于现有代表性防御方法。这些发现表明,防御者可控制的后门可作为消除生成式大语言模型中未知后门的有益代理手段。