Despite a great deal of research, it is still not well-understood why trained neural networks are highly vulnerable to adversarial examples. In this work we focus on two-layer neural networks trained using data which lie on a low dimensional linear subspace. We show that standard gradient methods lead to non-robust neural networks, namely, networks which have large gradients in directions orthogonal to the data subspace, and are susceptible to small adversarial $L_2$-perturbations in these directions. Moreover, we show that decreasing the initialization scale of the training algorithm, or adding $L_2$ regularization, can make the trained network more robust to adversarial perturbations orthogonal to the data.
翻译:尽管已有大量研究,但训练好的神经网络为何对对抗样本高度脆弱仍未被充分理解。本文聚焦于使用位于低维线性子空间上的数据训练的两层神经网络。我们证明标准梯度方法会导致非鲁棒神经网络——即网络在正交于数据子空间的方向上具有较大梯度,且容易受到这些方向上微小对抗性$L_2$扰动的攻击。此外,研究表明减少训练算法的初始化尺度或添加$L_2$正则化,可以使训练后的网络对正交于数据的对抗扰动具有更强的鲁棒性。