Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However, most AT methods suffer from robust overfitting, i.e., a significant generalization gap in adversarial robustness between the training and testing curves. In this paper, we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments, we are motivated to propose a label refinement approach for AT. Specifically, our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels, and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets, attack types, and architectures. In addition, we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization.
翻译:对抗训练(AT)是目前获得深度神经网络对对抗攻击鲁棒性最有效的方法之一。然而,大多数AT方法存在鲁棒过拟合问题,即训练与测试曲线之间在对抗鲁棒性上出现显著的泛化差距。本文首先从梯度范数视角揭示了鲁棒过拟合与AT中对噪声标签过度记忆之间的内在联系。由于此类标签噪声主要由分布失配和不当标签分配导致,我们提出了一种针对AT的标签精炼方法。具体而言,我们提出的自引导标签精炼首先从过度自信的硬标签中自精炼出更准确、信息量更丰富的标签分布,随后通过将自蒸馏模型的知识动态融入当前模型来校准训练过程,从而无需外部教师模型。实验结果表明,该方法在多个基准数据集、攻击类型及网络架构上能同时提升标准准确率与鲁棒性能。此外,我们还从信息论角度进行了一系列分析以深入探究该方法,并揭示了软标签对鲁棒泛化的重要性。