Adversarial training (AT) is the de facto method for building robust neural networks, but it can be computationally expensive. To mitigate this, fast single-step attacks can be used, but this may lead to catastrophic overfitting (CO). This phenomenon appears when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. The mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in single-step AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced at much smaller $\epsilon$ values than it was observed before just by injecting images with seemingly innocuous features. These features aid non-robust classification but are not enough to achieve robustness on their own. Through extensive experiments we analyze this novel phenomenon and discover that the presence of these easy features induces a learning shortcut that leads to CO. Our findings provide new insights into the mechanisms of CO and improve our understanding of the dynamics of AT. The code to reproduce our experiments can be found at https://github.com/gortizji/co_features.
翻译:对抗训练(AT)是构建鲁棒神经网络的常用方法,但其计算成本较高。为缓解这一问题,可使用快速单步攻击,但这可能导致灾难性过拟合(CO)。该现象表现为网络在AT初始阶段获得一定鲁棒性后,仅需数轮迭代便急剧失效,其触发机制至今尚不明确。本研究通过调控自然图像典型数据集,系统探究单步AT方法中CO的触发机制。具体而言,我们证实通过注入看似无害的特征,可在远低于先前观测值的$\epsilon$条件下诱发CO。这些特征有助于非鲁棒分类,但不足以独立实现鲁棒性。通过大量实验,我们分析这一新现象并发现:易学特征的存在会诱导学习捷径,进而导致CO。本工作为CO触发机制提供了新见解,并加深了对AT动态过程的理解。复现实验的代码见https://github.com/gortizji/co_features。