One prominent approach toward resolving the adversarial vulnerability of deep neural networks is the two-player zero-sum paradigm of adversarial training, in which predictors are trained against adversarially chosen perturbations of data. Despite the promise of this approach, algorithms based on this paradigm have not engendered sufficient levels of robustness and suffer from pathological behavior like robust overfitting. To understand this shortcoming, we first show that the commonly used surrogate-based relaxation used in adversarial training algorithms voids all guarantees on the robustness of trained classifiers. The identification of this pitfall informs a novel non-zero-sum bilevel formulation of adversarial training, wherein each player optimizes a different objective function. Our formulation yields a simple algorithmic framework that matches and in some cases outperforms state-of-the-art attacks, attains comparable levels of robustness to standard adversarial training algorithms, and does not suffer from robust overfitting.
翻译:解决深度神经网络对抗脆弱性的一种重要方法是采用双人零和博弈的对抗训练范式,其中预测器针对对抗性选择的数据扰动进行训练。尽管该方法前景广阔,但基于该范式的算法尚未达到足够稳健性水平,且存在鲁棒过拟合等病态行为。为理解这一缺陷,我们首先证明对抗训练算法中常用的基于替代函数的松弛方法会破坏训练分类器鲁棒性的所有保证。这一缺陷的识别催生了一种新颖的对抗训练非零和双层优化框架,其中每个智能体优化不同的目标函数。我们的框架产生了简单的算法体系,能够在匹配甚至部分超越现有最优攻击的同时,达到与标准对抗训练算法相当的鲁棒性水平,且不存在鲁棒过拟合问题。