Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to apply more regularization to data vulnerable to adversarial attacks than other existing regularization algorithms do. Theoretically, we show that our algorithm can be understood as an algorithm of minimizing the regularized empirical risk motivated from a newly derived upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on examples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
翻译:对抗训练旨在增强模型对对抗攻击的鲁棒性,因其能够生成人类难以察觉的数据扰动以欺骗深度神经网络而备受关注。本文提出了一种新的对抗训练算法,该算法在理论上具有充分依据,且在实证表现上优于现有算法。该算法的创新之处在于,相较于其他正则化算法,它对易受对抗攻击的数据施加了更强的正则化约束。理论上,我们证明该算法可被理解为基于鲁棒风险新上界推导的正则化经验风险最小化算法。数值实验表明,所提算法能同时提升泛化能力(在正常样本上的准确率)和鲁棒性(在对抗攻击下的准确率),从而实现了最先进的性能表现。