Over the past decade, numerous theories have been proposed to explain the widespread vulnerability of deep neural networks to adversarial evasion attacks. Among these, the theory of non-robust features proposed by Ilyas et al. has been widely accepted, showing that brittle but predictive features of the data distribution can be directly exploited by attackers. However, this theory overlooks adversarial samples that do not directly utilize these features. In this work, we advocate that these two kinds of samples - those which use use brittle but predictive features and those that do not - comprise two types of adversarial weaknesses and should be differentiated when evaluating adversarial robustness. For this purpose, we propose an ensemble-based metric to measure the manipulation of non-robust features by adversarial perturbations and use this metric to analyze the makeup of adversarial samples generated by attackers. This new perspective also allows us to re-examine multiple phenomena, including the impact of sharpness-aware minimization on adversarial robustness and the robustness gap observed between adversarially training and standard training on robust datasets.
翻译:过去十年间,众多理论被提出以解释深度神经网络普遍存在的对抗性规避攻击脆弱性。其中,Ilyas等人提出的非鲁棒特征理论已被广泛接受,该理论表明攻击者可直接利用数据分布中脆弱但具有预测性的特征。然而,该理论忽略了那些未直接利用这些特征的对抗性样本。本研究中,我们主张这两种样本——即利用脆弱但具有预测性特征的样本与未利用此类特征的样本——构成了两种不同类型的对抗性脆弱性,在评估对抗鲁棒性时应予以区分。为此,我们提出一种基于集成学习的度量方法,用于衡量对抗性扰动对非鲁棒特征的操控程度,并利用该指标分析攻击者生成的对抗性样本构成。这一新视角也使我们得以重新审视多种现象,包括锐度感知最小化对对抗鲁棒性的影响,以及在鲁棒数据集上对抗训练与标准训练之间观察到的鲁棒性差距。