Deep neural networks (DNNs) have been found to be vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. While existing defense methods have demonstrated promising results, it is still not clear how to effectively remove backdoor-associated neurons in backdoored DNNs. In this paper, we propose a novel defense called \emph{Reconstructive Neuron Pruning} (RNP) to expose and prune backdoor neurons via an unlearning and then recovering process. Specifically, RNP first unlearns the neurons by maximizing the model's error on a small subset of clean samples and then recovers the neurons by minimizing the model's error on the same data. In RNP, unlearning is operated at the neuron level while recovering is operated at the filter level, forming an asymmetric reconstructive learning procedure. We show that such an asymmetric process on only a few clean samples can effectively expose and prune the backdoor neurons implanted by a wide range of attacks, achieving a new state-of-the-art defense performance. Moreover, the unlearned model at the intermediate step of our RNP can be directly used to improve other backdoor defense tasks including backdoor removal, trigger recovery, backdoor label detection, and backdoor sample detection. Code is available at \url{https://github.com/bboylyg/RNP}.
翻译:深度神经网络(DNNs)已被发现易受后门攻击,这引发了其在关键任务应用中部署的安全性担忧。尽管现有防御方法已展现出令人鼓舞的结果,但如何有效移除被植入后门DNNs中的后门相关神经元依然不够明确。本文提出一种名为"重建性神经元剪枝"(Reconstructive Neuron Pruning, RNP)的新型防御方法,通过"去学习"与"恢复"过程来暴露并剪除后门神经元。具体而言,RNP首先通过最大化模型对少量干净样本的误差来去学习神经元,随后通过最小化模型对相同数据的误差来恢复神经元。在RNP中,去学习操作在神经元级别进行,而恢复操作在滤波器级别进行,从而形成非对称重建学习过程。我们证明,仅利用少量干净样本进行这种非对称过程即可有效暴露并剪除各类攻击植入的后门神经元,实现了当前最优的防御性能。此外,RNP中间步骤中经过去学习的模型可直接用于改进其他后门防御任务,包括后门移除、触发器恢复、后门标签检测及后门样本检测。代码已开源在 \url{https://github.com/bboylyg/RNP}。