The existence of adversarial attacks on machine learning models imperceptible to a human is still quite a mystery from a theoretical perspective. In this work, we introduce two notions of adversarial attacks: natural or on-manifold attacks, which are perceptible by a human/oracle, and unnatural or off-manifold attacks, which are not. We argue that the existence of the off-manifold attacks is a natural consequence of the dimension gap between the intrinsic and ambient dimensions of the data. For 2-layer ReLU networks, we prove that even though the dimension gap does not affect generalization performance on samples drawn from the observed data space, it makes the clean-trained model more vulnerable to adversarial perturbations in the off-manifold direction of the data space. Our main results provide an explicit relationship between the $\ell_2,\ell_{\infty}$ attack strength of the on/off-manifold attack and the dimension gap.
翻译:对抗攻击对机器学习模型的存在性在理论上仍是一个谜,这类攻击对人类而言难以察觉。本文提出了两种对抗攻击的概念:自然攻击(或流形内攻击),可被人类/先知感知;以及非自然攻击(或流形外攻击),不可被感知。我们论证了流形外攻击的存在是数据内在维度与环境维度之间维度差距的自然结果。对于两层ReLU网络,我们证明了尽管维度差距不影响从观测数据空间采样的样本的泛化性能,但它使得干净训练模型更容易受到数据空间流形外方向上的对抗扰动。我们的主要成果揭示了流形内/流形外攻击的$\ell_2,\ell_{\infty}$攻击强度与维度差距之间的显式关系。