Adversarial training, especially projected gradient descent (PGD), has proven to be a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.
翻译:对抗训练,尤其是投影梯度下降法(PGD),已被证明是提升模型对抗攻击鲁棒性的有效方法。经过对抗训练后,模型输入相对于其输出的梯度会呈现出一个优先方向。然而,该对齐方向的数学定义尚未明确建立,导致难以进行定量评估。我们提出了一种新的方向定义:将其定义为决策空间中指向最近错误类别支撑集最近点的向量方向。为评估对抗训练后与该方向的对齐程度,我们采用了一种基于生成对抗网络的度量方法,该方法可生成改变图像中类别所需的最小残差。研究结果表明,根据我们的定义,经过PGD训练的模型比基线模型具有更高的对齐度;与竞争性度量公式相比,我们的度量方法呈现出更高的对齐值;并且强制要求这种对齐可提升模型的鲁棒性。