Adversarial attacks insert small, imperceptible perturbations to input samples that cause large, undesired changes to the output of deep learning models. Despite extensive research on generating adversarial attacks and building defense systems, there has been limited research on understanding adversarial attacks from an input-data perspective. This work introduces the notion of sample attackability, where we aim to identify samples that are most susceptible to adversarial attacks (attackable samples) and conversely also identify the least susceptible samples (robust samples). We propose a deep-learning-based method to detect the adversarially attackable and robust samples in an unseen dataset for an unseen target model. Experiments on standard image classification datasets enables us to assess the portability of the deep attackability detector across a range of architectures. We find that the deep attackability detector performs better than simple model uncertainty-based measures for identifying the attackable/robust samples. This suggests that uncertainty is an inadequate proxy for measuring sample distance to a decision boundary. In addition to better understanding adversarial attack theory, it is found that the ability to identify the adversarially attackable and robust samples has implications for improving the efficiency of sample-selection tasks, e.g. active learning in augmentation for adversarial training.
翻译:对抗攻击通过在输入样本中注入微小、不易察觉的扰动,导致深度学习模型输出产生巨大且非期望的变化。尽管在生成对抗攻击和构建防御系统方面已有大量研究,但从输入数据角度理解对抗攻击的研究仍十分有限。本文引入样本可攻击性的概念,旨在识别最易受对抗攻击影响的样本(可攻击样本),反之亦能识别最不易受影响的样本(鲁棒样本)。我们提出一种基于深度学习的方法,用于在未见过的数据集中针对未见过的目标模型检测可对抗攻击与鲁棒的样本。在标准图像分类数据集上的实验使我们能够评估深度可攻击性检测器在多种架构上的可移植性。研究发现,与基于模型不确定性的简单度量相比,深度可攻击性检测器在识别可攻击/鲁棒样本方面表现更优。这表明不确定性并非衡量样本到决策边界距离的充分代理。除加深对对抗攻击理论的理解外,识别可对抗攻击与鲁棒样本的能力对于提升样本选择任务(例如对抗训练中增强模块的主动学习)的效率具有实际意义。