Backdoor defense, which aims to detect or mitigate the effect of malicious triggers introduced by attackers, is becoming increasingly critical for machine learning security and integrity. Fine-tuning based on benign data is a natural defense to erase the backdoor effect in a backdoored model. However, recent studies show that, given limited benign data, vanilla fine-tuning has poor defense performance. In this work, we provide a deep study of fine-tuning the backdoored model from the neuron perspective and find that backdoorrelated neurons fail to escape the local minimum in the fine-tuning process. Inspired by observing that the backdoorrelated neurons often have larger norms, we propose FTSAM, a novel backdoor defense paradigm that aims to shrink the norms of backdoor-related neurons by incorporating sharpness-aware minimization with fine-tuning. We demonstrate the effectiveness of our method on several benchmark datasets and network architectures, where it achieves state-of-the-art defense performance. Overall, our work provides a promising avenue for improving the robustness of machine learning models against backdoor attacks.
翻译:后门防御旨在检测或缓解攻击者引入的恶意触发器的影响,对于机器学习的安全性和完整性正变得日益关键。基于良性数据的微调是一种自然的防御手段,用于消除后门模型中的后门效应。然而,近期研究表明,在有限的良性数据条件下,普通微调的防御性能较差。在本工作中,我们从神经元角度深入研究了后门模型的微调过程,并发现后门相关神经元在微调过程中难以逃离局部最小值。受后门相关神经元通常具有更大范数的观察启发,我们提出了FTSAM——一种新颖的后门防御范式,其核心思想是通过将锐度感知最小化与微调相结合,来缩小后门相关神经元的范数。我们在多个基准数据集和网络架构上验证了该方法的效果,并取得了最先进的防御性能。总体而言,我们的工作为提升机器学习模型对抗后门攻击的鲁棒性提供了一条有前景的途径。