Backdoor attacks pose a serious threat to deep neural networks (DNNs), allowing adversaries to implant triggers for hidden behaviors in inference. Defending against such vulnerabilities is especially difficult in the post-training setting, since end-users lack training data or prior knowledge of the attacks. Model merging offers a cost-effective defense; however, latest methods like weight averaging (WAG) provide reasonable protection when multiple homologous models are available, but are less effective with fewer models and place heavy demands on defenders. We propose a module-switching defense (MSD) for disrupting backdoor shortcuts. We first validate its theoretical rationale and empirical effectiveness on two-layer networks, showing its capability of achieving higher backdoor divergence than WAG, and preserving utility. For deep models, we evaluate MSD on Transformer and CNN architectures and design an evolutionary algorithm to optimize fusion strategies with selective mechanisms to identify the most effective combinations. Experiments show that MSD achieves stronger defense with fewer models in practical settings, and even under an underexplored case of collusive attacks among multiple models--where some models share the same backdoors--switching strategies by MSD deliver superior robustness against diverse attacks. Code is available at https://github.com/weijun-l/module-switching-defense.
翻译:后门攻击对深度神经网络(DNN)构成严重威胁,使攻击者能够在推理阶段植入触发器以隐藏模型行为。在训练后场景下防御此类漏洞尤为困难,因为终端用户缺乏训练数据或对攻击的先验知识。模型融合提供了一种经济有效的防御手段;然而,权重平均(WAG)等最新方法在拥有多个同源模型时可提供合理保护,但在模型数量较少时效果欠佳,且对防御方要求较高。我们提出了一种模块切换防御方法(MSD)用于破坏后门捷径。首先,我们在两层网络上验证了其理论依据和实证有效性,表明该方法能够比WAG实现更高的后门散度,同时保持模型效用。针对深度模型,我们在Transformer和CNN架构上评估了MSD,并设计了一种进化算法,通过选择性机制优化融合策略以识别最有效的组合。实验表明,MSD在实际场景中能以更少的模型实现更强的防御,甚至在多个模型之间存在合谋攻击(即部分模型共享相同后门)这一尚未充分研究的案例中,MSD的切换策略仍能对各种攻击展现出优越的鲁棒性。代码开源在 https://github.com/weijun-l/module-switching-defense。