Backdoor attacks on deep learning represent a recent threat that has gained significant attention in the research community. Backdoor defenses are mainly based on backdoor inversion, which has been shown to be generic, model-agnostic, and applicable to practical threat scenarios. State-of-the-art backdoor inversion recovers a mask in the feature space to locate prominent backdoor features, where benign and backdoor features can be disentangled. However, it suffers from high computational overhead, and we also find that it overly relies on prominent backdoor features that are highly distinguishable from benign features. To tackle these shortcomings, this paper improves backdoor feature inversion for backdoor detection by incorporating extra neuron activation information. In particular, we adversarially increase the loss of backdoored models with respect to weights to activate the backdoor effect, based on which we can easily differentiate backdoored and clean models. Experimental results demonstrate our defense, BAN, is 1.37$\times$ (on CIFAR-10) and 5.11$\times$ (on ImageNet200) more efficient with 9.99% higher detect success rate than the state-of-the-art defense BTI-DBF. Our code and trained models are publicly available.\url{https://anonymous.4open.science/r/ban-4B32}
翻译:深度学习中的后门攻击是近期引起研究界广泛关注的一种新型威胁。后门防御主要基于后门逆向工程,该方法已被证明具有通用性、模型无关性,并适用于实际威胁场景。最先进的后门逆向工程技术通过在特征空间中恢复掩码来定位显著的后门特征,从而实现良性特征与后门特征的解耦。然而,该方法存在计算开销过高的问题,并且我们发现其过度依赖与良性特征高度可区分的显著后门特征。针对这些缺陷,本文通过引入额外的神经元激活信息来改进后门特征逆向工程,以提升后门检测性能。具体而言,我们通过对后门模型的权重施加对抗性扰动以增加损失函数值,从而激活后门效应,基于此机制可以轻松区分后门模型与干净模型。实验结果表明,我们的防御方法BAN在CIFAR-10数据集上效率提升1.37倍,在ImageNet200数据集上提升5.11倍,同时检测成功率比最先进的防御方法BTI-DBF提高9.99%。我们的代码与训练模型已公开于\url{https://anonymous.4open.science/r/ban-4B32}。