Fast Adversarial Training (FAT) has gained increasing attention within the research community owing to its efficacy in improving adversarial robustness. Particularly noteworthy is the challenge posed by catastrophic overfitting (CO) in this field. Although existing FAT approaches have made strides in mitigating CO, the ascent of adversarial robustness occurs with a non-negligible decline in classification accuracy on clean samples. To tackle this issue, we initially employ the feature activation differences between clean and adversarial examples to analyze the underlying causes of CO. Intriguingly, our findings reveal that CO can be attributed to the feature coverage induced by a few specific pathways. By intentionally manipulating feature activation differences in these pathways with well-designed regularization terms, we can effectively mitigate and induce CO, providing further evidence for this observation. Notably, models trained stably with these terms exhibit superior performance compared to prior FAT work. On this basis, we harness CO to achieve `attack obfuscation', aiming to bolster model performance. Consequently, the models suffering from CO can attain optimal classification accuracy on both clean and adversarial data when adding random noise to inputs during evaluation. We also validate their robustness against transferred adversarial examples and the necessity of inducing CO to improve robustness. Hence, CO may not be a problem that has to be solved.
翻译:快速对抗训练(FAT)因其在提升对抗鲁棒性方面的有效性而日益受到研究界的关注。尤其值得注意的是,该领域中的灾难性过拟合(CO)问题构成了重大挑战。尽管现有FAT方法在缓解CO方面取得了进展,但对抗鲁棒性的提升却伴随着干净样本分类准确率的不可忽视的下降。为应对这一问题,我们首先利用干净样本与对抗样本之间的特征激活差异来分析CO的根本原因。有趣的是,我们的发现表明:CO可归因于少数特定路径引发的特征覆盖。通过精心设计的正则化项有意操控这些路径中的特征激活差异,我们既能有效缓解也能诱导CO,从而为这一观察提供了进一步证据。值得注意的是,使用这些项稳定训练的模型相较于先前的FAT工作表现出更优的性能。在此基础上,我们利用CO实现"攻击混淆",旨在增强模型性能。因此,在评估过程中向输入添加随机噪声时,遭受CO的模型在干净数据与对抗数据上均可获得最优分类准确率。我们还验证了其对迁移对抗样本的鲁棒性,以及诱导CO对提升鲁棒性的必要性。由此看来,CO或许并非一个必须解决的问题。