Adversarial examples mainly exploit changes to input pixels to which humans are not sensitive to, and arise from the fact that models make decisions based on uninterpretable features. Interestingly, cognitive science reports that the process of interpretability for human classification decision relies predominantly on low spatial frequency components. In this paper, we investigate the robustness to adversarial perturbations of models enforced during training to leverage information corresponding to different spatial frequency ranges. We show that it is tightly linked to the spatial frequency characteristics of the data at stake. Indeed, depending on the data set, the same constraint may results in very different level of robustness (up to 0.41 adversarial accuracy difference). To explain this phenomenon, we conduct several experiments to enlighten influential factors such as the level of sensitivity to high frequencies, and the transferability of adversarial perturbations between original and low-pass filtered inputs.
翻译:对抗样本主要利用人类不敏感的输入像素变化,其根源在于模型基于不可解释特征做出决策。有趣的是,认知科学报告指出,人类分类决策的可解释性过程主要依赖于低频空间分量。本文研究了训练过程中强制模型利用不同空间频率范围信息的对抗扰动鲁棒性。我们发现,这种鲁棒性与相关数据的空间频率特征密切相关。实际上,相同约束在不同数据集上可能导致显著不同的鲁棒性水平(对抗准确率差异最高达0.41)。为解释这一现象,我们设计了多项实验,揭示了高频敏感性、原始输入与低通滤波输入间对抗扰动可迁移性等关键影响因素。