Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.
翻译:当前的后门防御假设已知触发器的中和能消除后门。我们证明这种以触发器为中心的观点是不完整的:\emph{替代触发器}——即与训练触发器在感知上截然不同的模式——能够可靠地激活同一后门。我们通过对比干净样本与受触发样本的表征,在特征空间中估计替代触发器的后门方向,并开发了一种联合优化目标预测与方向对齐的特征引导攻击。首先,我们从理论上证明替代触发器的存在是后门训练的必然结果。随后,我们通过实验验证了这一结论。此外,移除训练触发器的防御方法往往使后门保持完整,而替代触发器能够利用潜在的后门特征空间。我们的发现启示了应针对表征空间中的后门方向而非输入空间的触发器进行防御。