Deep neural networks (DNNs) have been found to be vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. While existing defense methods have demonstrated promising results, it is still not clear how to effectively remove backdoor-associated neurons in backdoored DNNs. In this paper, we propose a novel defense called \emph{Reconstructive Neuron Pruning} (RNP) to expose and prune backdoor neurons via an unlearning and then recovering process. Specifically, RNP first unlearns the neurons by maximizing the model's error on a small subset of clean samples and then recovers the neurons by minimizing the model's error on the same data. In RNP, unlearning is operated at the neuron level while recovering is operated at the filter level, forming an asymmetric reconstructive learning procedure. We show that such an asymmetric process on only a few clean samples can effectively expose and prune the backdoor neurons implanted by a wide range of attacks, achieving a new state-of-the-art defense performance. Moreover, the unlearned model at the intermediate step of our RNP can be directly used to improve other backdoor defense tasks including backdoor removal, trigger recovery, backdoor label detection, and backdoor sample detection. Code is available at \url{https://github.com/bboylyg/RNP}.
翻译:深度神经网络已被发现易受后门攻击,这引发了对其在关键任务应用中部署的安全担忧。尽管现有防御方法已展现出有前景的结果,但如何有效清除被后门攻击污染的网络中的后门关联神经元仍不明确。本文提出一种名为"重构式神经元剪枝"(RNP)的新型防御方法,通过"遗忘-恢复"过程暴露并剪除后门神经元。具体而言,RNP首先通过最大化模型在少量干净样本上的误差来遗忘神经元,随后通过最小化模型在同一数据集上的误差来恢复神经元。在RNP中,遗忘操作在神经元层面进行,而恢复操作在滤波器层面进行,形成非对称重构学习过程。我们证明,这种仅基于少量干净样本的非对称过程能有效暴露并剪除各类攻击植入的后门神经元,实现新的最优防御性能。此外,RNP中间步骤产生的遗忘模型可直接用于改进其他后门防御任务,包括后门移除、触发器恢复、后门标签检测和后门样本检测。代码可访问https://github.com/bboylyg/RNP获取。