Despite a great deal of research, it is still not well-understood why trained neural networks are highly vulnerable to adversarial examples. In this work we focus on two-layer neural networks trained using data which lie on a low dimensional linear subspace. We show that standard gradient methods lead to non-robust neural networks, namely, networks which have large gradients in directions orthogonal to the data subspace, and are susceptible to small adversarial $L_2$-perturbations in these directions. Moreover, we show that decreasing the initialization scale of the training algorithm, or adding $L_2$ regularization, can make the trained network more robust to adversarial perturbations orthogonal to the data.
翻译:尽管已有大量研究,但训练好的神经网络为何高度脆弱于对抗样本仍未被充分理解。本文聚焦于使用低维线性子空间上数据训练的两层神经网络。我们证明标准梯度方法会导致非鲁棒神经网络——即在与数据子空间正交的方向上具有大梯度,且易受这些方向上小幅度$L_2$对抗扰动影响的网络。此外,我们表明减小训练算法的初始化尺度或添加$L_2$正则化,可使训练后的网络对数据正交方向的对抗扰动更具鲁棒性。