Although fast adversarial training provides an efficient approach for building robust networks, it may suffer from a serious problem known as catastrophic overfitting (CO), where multi-step robust accuracy suddenly collapses to zero. In this paper, we for the first time decouple single-step adversarial examples into data-information and self-information, which reveals an interesting phenomenon called "self-fitting". Self-fitting, i.e., the network learns the self-information embedded in single-step perturbations, naturally leads to the occurrence of CO. When self-fitting occurs, the network experiences an obvious "channel differentiation" phenomenon that some convolution channels accounting for recognizing self-information become dominant, while others for data-information are suppressed. In this way, the network can only recognize images with sufficient self-information and loses generalization ability to other types of data. Based on self-fitting, we provide new insights into the existing methods to mitigate CO and extend CO to multi-step adversarial training. Our findings reveal a self-learning mechanism in adversarial training and open up new perspectives for suppressing different kinds of information to mitigate CO.
翻译:尽管快速对抗训练为构建鲁棒网络提供了高效途径,但其可能遭遇一个严重问题——灾难性过拟合(CO),即多步鲁棒准确率突然骤降至零。本文首次将单步对抗样本解耦为数据信息与自信息,揭示了一种被称为"自拟合"的有趣现象。自拟合,即网络学习嵌于单步扰动中的自信息,自然导致了CO的发生。当自拟合发生时,网络会出现明显的"通道分化"现象:部分负责识别自信息的卷积通道占据主导地位,而负责数据信息的通道则受到抑制。如此一来,网络只能识别具有充分自信息的图像,丧失了对其他类型数据的泛化能力。基于自拟合,我们为现有缓解CO的方法提供了新见解,并将CO扩展至多步对抗训练。我们的发现揭示了对抗训练中的自学习机制,为通过抑制不同信息类型来缓解CO开辟了新视角。