Existing studies reveal that current backdoor defenses exhibit limited robustness and often fail against specific types of attacks. More concerningly, prevailing safety tuning strategies tend to provide only superficial safety protection, as they fall short of completely eliminating the backdoor effects. In this work, we present a novel formulation of backdoor learning and unlearning as a sequential, three-stage process from a continual learning perspective. Within this framework, we formally define complete backdoor unlearning and further derive the necessary conditions for achieving it based on the mechanism of catastrophic forgetting. Guided by these insights, we propose Blind Inversion-Backdoor Adversarial Unlearning (BI-BAU), which formulates the generation of adversarial examples satisfying the unlearning conditions as a blind inversion problem. We solve this by integrating the bi-level optimization process of adversarial training into an Expectation-Maximization (EM) algorithm framework to optimize the maximum a posteriori (MAP) objective. Furthermore, BI-BAU is extended to untargeted adversarial scenarios with unknown target classes, as well as to multi-modal contrastive learning tasks, enhancing its applicability to real-world deployment scenarios where pre-trained models may be compromised. Extensive experiments demonstrate that our method exhibits general applicability across a wide spectrum of backdoor attacks and can effectively and thoroughly eliminate the backdoor effects from a backdoor model.
翻译:现有研究表明,当前后门防御方法鲁棒性有限,且常无法抵御特定类型的攻击。更令人担忧的是,主流的安全微调策略往往仅提供表面化的安全防护,因为它们未能完全消除后门效应。本文从持续学习的视角,首次将后门学习与反学习形式化为一个三阶段序列过程。在该框架下,我们正式定义了完全后门反学习,并基于灾难性遗忘机制推导出实现完全后门反学习的必要条件。受此启发,我们提出盲反演-后门对抗反学习(BI-BAU),将满足反学习条件的对抗样本生成建模为盲反演问题。通过将对抗训练的双层优化过程融入期望最大化(EM)算法框架来优化最大后验(MAP)目标,我们解决了该问题。此外,BI-BAU被扩展至未知目标类别的非定向对抗场景以及多模态对比学习任务,增强了其在预训练模型可能被攻陷的实际部署场景中的适用性。大量实验证明,本方法可广泛适用于多种后门攻击,并能有效且彻底地从后门模型中消除后门效应。