Deep neural networks are vulnerable to adversarial noise. Adversarial Training (AT) has been demonstrated to be the most effective defense strategy to protect neural networks from being fooled. However, we find AT omits to learning robust features, resulting in poor performance of adversarial robustness. To address this issue, we highlight two criteria of robust representation: (1) Exclusion: \emph{the feature of examples keeps away from that of other classes}; (2) Alignment: \emph{the feature of natural and corresponding adversarial examples is close to each other}. These motivate us to propose a generic framework of AT to gain robust representation, by the asymmetric negative contrast and reverse attention. Specifically, we design an asymmetric negative contrast based on predicted probabilities, to push away examples of different classes in the feature space. Moreover, we propose to weight feature by parameters of the linear classifier as the reverse attention, to obtain class-aware feature and pull close the feature of the same class. Empirical evaluations on three benchmark datasets show our methods greatly advance the robustness of AT and achieve state-of-the-art performance.
翻译:深度神经网络易受对抗噪声影响。对抗训练已被证明是保护神经网络免受欺骗的最有效防御策略。然而,我们发现对抗训练忽略了学习鲁棒特征,导致对抗鲁棒性表现不佳。为解决此问题,我们强调了鲁棒表示的两个准则:(1) 排除:样本特征远离其他类别的特征;(2) 对齐:自然样本与对应对抗样本的特征彼此接近。这启发我们提出一种基于非对称负对比与反向注意力的通用对抗训练框架,以获得鲁棒表示。具体而言,我们基于预测概率设计了非对称负对比,以在特征空间中推开不同类别的样本。此外,我们提出使用线性分类器的参数作为反向注意力对特征进行加权,以获取类别感知特征并拉近同类特征。在三个基准数据集上的实验评估表明,我们的方法显著提升了对抗训练的鲁棒性,并达到了最优性能。