Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with an average 9.99\% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available at~\url{https://github.com/xiaoyunxxy/ban}.
翻译:深度学习中的后门攻击是近期受到研究界广泛关注的新型安全威胁。后门防御主要基于后门逆向工程方法,该方法已被证明具有通用性、模型无关性,且适用于实际威胁场景。最先进的后门逆向技术通过在特征空间恢复掩码来定位显著后门特征,从而实现良性特征与后门特征的解耦。然而,该方法存在计算开销过高的问题,并且我们发现其过度依赖与良性特征高度可区分的显著后门特征。为克服这些缺陷,本文通过引入额外的神经元激活信息来改进后门特征逆向检测方法。具体而言,我们通过对权重施加对抗性扰动来增加后门模型的损失函数值,从而激活后门效应,基于此机制可轻松区分后门模型与干净模型。实验结果表明,我们的防御方法BAN在CIFAR-10数据集上效率提升1.37倍,在ImageNet200数据集上提升5.11倍,且检测成功率平均比最先进的BTI-DBF防御方法高出9.99%。代码与训练模型已公开于~\url{https://github.com/xiaoyunxxy/ban}。