One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially-chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness, and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation naturally yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.
翻译:解决深度神经网络对抗脆弱性的一个主要方法是采用双人零和博弈范式的对抗训练,其中预测器针对数据中由对抗方式选择的扰动进行训练。尽管该方法前景广阔,但基于该范式的算法尚未产生足够的鲁棒性水平,且存在鲁棒过拟合等病理行为。为理解这一缺陷,我们首先证明对抗训练算法中常用的基于替代的松弛方法会消除已训练分类器鲁棒性的所有保证。对这一陷阱的识别引出了对抗训练的一种新型非零和双层公式,其中每个玩家优化不同的目标函数。我们的公式自然衍生出一个简单的算法框架,该框架在匹配甚至某些情况下超越最先进攻击方法的同时,能达到与标准对抗训练算法相当的鲁棒性水平,且不会出现鲁棒过拟合问题。