Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Paradoxically, empirical evidence indicates that even systems which are robust to large random perturbations of the input data remain susceptible to small, easily constructed, adversarial perturbations of their inputs. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability -- notably the simultaneous susceptibility of the (otherwise accurate) model to easily constructed adversarial attacks, and robustness to random perturbations of the input data. We confirm that the same phenomena are directly observed in practical neural networks trained on standard image classification problems, where even large additive random noise fails to trigger the adversarial instability of the network. A surprising takeaway is that even small margins separating a classifier's decision surface from training and testing data can hide adversarial susceptibility from being detected using randomly sampled perturbations. Counterintuitively, using additive noise during training or testing is therefore inefficient for eradicating or detecting adversarial examples, and more demanding adversarial training is required.
翻译:对抗攻击通过对输入数据进行看似无关紧要的微小修改,便能彻底改变原本精确的学习系统的输出结果。矛盾的是,经验证据表明,即使对输入数据的大范围随机扰动具有鲁棒性的系统,仍然容易受到易于构造的微小对抗扰动的影响。在此,我们表明这可被视为处理高维输入数据的分类器的一个基本特征。我们引入了一个简单且通用的框架,在该框架下,实际系统中观察到的关键行为将以高概率出现——尤其是(原本精确的)模型同时易于受到易于构造的对抗攻击的影响,以及对输入数据随机扰动的鲁棒性。我们证实,在标准图像分类问题训练的实际神经网络中可直接观察到相同现象,即使是大规模加性随机噪声也无法触发网络的对抗不稳定性。一个令人惊讶的启示是:即使分隔分类器决策面与训练及测试数据的微小间隔,也可能隐藏对抗易感性,使其无法通过随机采样扰动检测到。反直觉的是,在训练或测试过程中使用加性噪声对于消除或检测对抗样本是低效的,因此需要更具挑战性的对抗性训练。