It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor attacks where attackers could manipulate the model behavior maliciously by tampering with a small set of training samples. Although a line of defense methods is proposed to mitigate this threat, they either require complicated modifications to the training process or heavily rely on the specific model architecture, which makes them hard to deploy into real-world applications. Therefore, in this paper, we instead start with fine-tuning, one of the most common and easy-to-deploy backdoor defenses, through comprehensive evaluations against diverse attack scenarios. Observations made through initial experiments show that in contrast to the promising defensive results on high poisoning rates, vanilla tuning methods completely fail at low poisoning rate scenarios. Our analysis shows that with the low poisoning rate, the entanglement between backdoor and clean features undermines the effect of tuning-based defenses. Therefore, it is necessary to disentangle the backdoor and clean features in order to improve backdoor purification. To address this, we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor purification. Specifically, FST encourages feature shifts by actively deviating the classifier weights from the originally compromised weights. Extensive experiments demonstrate that our FST provides consistently stable performance under different attack settings. Additionally, it is also convenient to deploy in real-world scenarios with significantly reduced computation costs. Our codes are available at https://github.com/AISafety-HKUST/stable_backdoor_purification.
翻译:深度神经网络被广泛证实易受后门攻击,攻击者可通过篡改少量训练样本恶意操纵模型行为。尽管已有诸多防御方法被提出以缓解该威胁,但它们或要求对训练过程进行复杂修改,或严重依赖特定模型架构,导致难以部署于实际应用场景。为此,本文从最常用且易于部署的后门防御手段——微调出发,针对多样化攻击场景展开全面评估。初步实验观察表明:与高投毒率下显著的防御效果形成鲜明对比的是,原始微调方法在低投毒率场景中完全失效。分析显示,低投毒率条件下后门特征与干净特征间的纠缠削弱了基于微调的防御效果,因此需解耦两者以提升后门净化能力。针对该问题,我们提出特征偏移微调(FST)——一种基于微调的后门净化方法。具体而言,FST通过主动偏离原始受损分类器权重来促进特征偏移。大量实验证明,FST在不同攻击设定下均能提供持续稳定的性能。同时,该方法在显著降低计算开销的同时,便于实际场景部署。我们的代码开源于 https://github.com/AISafety-HKUST/stable_backdoor_purification。