Adversarial training, which is to enhance robustness against adversarial attacks, has received much attention because it is easy to generate human-imperceptible perturbations of data to deceive a given deep neural network. In this paper, we propose a new adversarial training algorithm that is theoretically well motivated and empirically superior to other existing algorithms. A novel feature of the proposed algorithm is to apply more regularization to data vulnerable to adversarial attacks than other existing regularization algorithms do. Theoretically, we show that our algorithm can be understood as an algorithm of minimizing the regularized empirical risk motivated from a newly derived upper bound of the robust risk. Numerical experiments illustrate that our proposed algorithm improves the generalization (accuracy on examples) and robustness (accuracy on adversarial attacks) simultaneously to achieve the state-of-the-art performance.
翻译:对抗训练旨在增强对对抗攻击的鲁棒性,由于能轻易生成人类无法察觉的数据扰动以欺骗给定深度神经网络,该技术已受到广泛关注。本文提出一种新的对抗训练算法,该算法具有坚实的理论动机,且在经验上优于现有其他算法。该算法的新颖之处在于,相较于其他现有正则化算法,它能对易受对抗攻击的数据施加更多正则化。理论上,我们证明该算法可理解为一种最小化正则化经验风险的方法,其动机源于新推导的鲁棒风险上界。数值实验表明,我们提出的算法能同时提升泛化性能(对一般样本的准确率)和鲁棒性(对对抗攻击的准确率),从而实现当前最优性能。