Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by well-designed perturbations. This could lead to disastrous results on critical applications such as self-driving cars, surveillance security, and medical diagnosis. At present, adversarial training is one of the most effective defenses against adversarial examples. However, traditional adversarial training makes it difficult to achieve a good trade-off between clean accuracy and robustness since spurious features are still learned by DNNs. The intrinsic reason is that traditional adversarial training makes it difficult to fully learn core features from adversarial examples when adversarial noise and clean examples cannot be disentangled. In this paper, we disentangle the adversarial examples into natural and perturbed patterns by bit-plane slicing. We assume the higher bit-planes represent natural patterns and the lower bit-planes represent perturbed patterns, respectively. We propose a Feature-Focusing Adversarial Training (F$^2$AT), which differs from previous work in that it enforces the model to focus on the core features from natural patterns and reduce the impact of spurious features from perturbed patterns. The experimental results demonstrated that F$^2$AT outperforms state-of-the-art methods in clean accuracy and adversarial robustness.
翻译:深度神经网络(DNNs)易受精心设计的扰动所构造的对抗样本攻击,这可能在自动驾驶、安防监控和医疗诊断等关键应用中导致灾难性后果。目前,对抗训练是防御对抗样本最有效的方法之一。然而,传统对抗训练难以在干净准确率与鲁棒性之间实现良好平衡,因为深度神经网络仍会学习到虚假特征。其根本原因在于,当对抗噪声与干净样本无法解耦时,传统对抗训练难以从对抗样本中充分学习核心特征。本文通过位平面切片将对抗样本解耦为自然模式与扰动模式,分别假设高位平面代表自然模式,低位平面代表扰动模式。我们提出特征聚焦对抗训练(F$^2$AT),该方法不同于以往工作之处在于:强制模型聚焦于自然模式中的核心特征,同时降低扰动模式中虚假特征的影响。实验结果表明,F$^2$AT在干净准确率和对抗鲁棒性方面均优于现有最先进方法。