It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor attacks where attackers could manipulate the model behavior maliciously by tampering with a small set of training samples. Although a line of defense methods is proposed to mitigate this threat, they either require complicated modifications to the training process or heavily rely on the specific model architecture, which makes them hard to deploy into real-world applications. Therefore, in this paper, we instead start with fine-tuning, one of the most common and easy-to-deploy backdoor defenses, through comprehensive evaluations against diverse attack scenarios. Observations made through initial experiments show that in contrast to the promising defensive results on high poisoning rates, vanilla tuning methods completely fail at low poisoning rate scenarios. Our analysis shows that with the low poisoning rate, the entanglement between backdoor and clean features undermines the effect of tuning-based defenses. Therefore, it is necessary to disentangle the backdoor and clean features in order to improve backdoor purification. To address this, we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor purification. Specifically, FST encourages feature shifts by actively deviating the classifier weights from the originally compromised weights. Extensive experiments demonstrate that our FST provides consistently stable performance under different attack settings. Without complex parameter adjustments, FST also achieves much lower tuning costs, only 10 epochs. Our codes are available at https://github.com/AISafety-HKUST/stable_backdoor_purification.
翻译:深度神经网络(DNN)极易受到后门攻击,攻击者可通过篡改少量训练样本恶意操控模型行为,这一现象已被广泛证实。尽管现有防御方法试图缓解此类威胁,但它们或需对训练过程进行复杂修改,或严重依赖特定模型架构,导致难以部署于实际应用场景。为此,本文从最常用且易部署的后门防御手段——微调入手,针对多样化攻击场景开展全面评估。初步实验表明:与高投毒率场景下令人鼓舞的防御效果相反,标准微调方法在低投毒率场景中完全失效。分析揭示,低投毒率条件下后门特征与干净特征的纠缠削弱了基于微调的防御效果,因此解耦后门特征与干净特征是提升后门净化的关键。针对该问题,我们提出特征偏移微调(FST)方法——一种基于微调的后门净化技术。具体而言,FST通过主动将分类器权重偏离原始受损权重来驱动特征偏移。大量实验证明,我们的FST在不同攻击设置下均能保持稳定性能。无需复杂参数调整,FST仅需10轮训练即可实现极低的微调成本。相关代码已开源至https://github.com/AISafety-HKUST/stable_backdoor_purification。