Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to apply more regularization to data vulnerable to adversarial attacks than other existing regularization algorithms do. Theoretically, we show that our algorithm can be understood as an algorithm of minimizing the regularized empirical risk motivated from a newly derived upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on examples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
翻译:对抗训练是一种增强模型抵御对抗性攻击鲁棒性的方法,由于能轻易生成人眼无法察觉的数据扰动来欺骗深度神经网络而受到广泛关注。本文提出一种新的对抗训练算法,该算法在理论上具有充分动机,且实证表现优于现有算法。该算法的核心创新在于:相较于其他现有正则化方法,它对易受对抗攻击的数据施加了更强的正则化约束。理论分析表明,该算法可理解为基于新导出的鲁棒风险上界的最小化正则化经验风险算法。数值实验证明,所提算法能同时提升泛化能力(对正常样本的准确率)与鲁棒性(对对抗攻击的准确率),达到当前最优性能水平。