Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.
翻译:对抗性攻击通过对输入数据进行看似微不足道的修改,显著改变原本准确的学习系统的输出。矛盾的是,经验证据表明,即使对于输入数据的大规模随机扰动具有鲁棒性的系统,仍然容易受到其输入数据微小且易于构建的对抗性扰动的影响。本文中,我们证明这可以被视为处理高维输入数据的分类器的一个基本特征。我们引入了一个简单、通用且可推广的框架,在该框架下,实际系统中观察到的关键行为以高概率出现——特别是(原本准确的)模型同时易于受到易于构建的对抗性攻击的影响,以及对输入数据随机扰动的鲁棒性。我们证实,在标准图像分类问题上训练的实际神经网络中直接观察到了相同的现象,其中即使是大规模的加性随机噪声也无法触发网络的对抗性不稳定性。一个令人惊讶的结论是,即使将分类器决策面与训练和测试数据分隔开来的边界很小,也能隐藏对抗性易受性,使其无法通过随机采样的扰动被检测到。反直觉的是,因此在训练或测试中使用加性噪声对于消除或检测对抗性样本是低效的,需要更严格的对抗性训练。