While adversarial training has been extensively studied for ResNet architectures and low resolution datasets like CIFAR, much less is known for ImageNet. Given the recent debate about whether transformers are more robust than convnets, we revisit adversarial training on ImageNet comparing ViTs and ConvNeXts. Extensive experiments show that minor changes in architecture, most notably replacing PatchStem with ConvStem, and training scheme have a significant impact on the achieved robustness. These changes not only increase robustness in the seen $\ell_\infty$-threat model, but even more so improve generalization to unseen $\ell_1/\ell_2$-attacks. Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust $\ell_\infty$-models across different ranges of model parameters and FLOPs, while our ViT + ConvStem yields the best generalization to unseen threat models.
翻译:尽管对抗训练已在ResNet架构和CIFAR等低分辨率数据集上得到广泛研究,但在ImageNet上的表现仍知之甚少。鉴于近期关于Transformer是否比卷积网络更具鲁棒性的争论,我们重新审视ImageNet上的对抗训练,对比ViT与ConvNeXt。大量实验表明,架构的微小变化——尤其是将PatchStem替换为ConvStem——以及训练方案对鲁棒性有显著影响。这些变化不仅提升了在已知$\ell_\infty$威胁模型下的鲁棒性,还显著改进了对未知$\ell_1/\ell_2$攻击的泛化能力。我们所改进的ConvNeXt(即ConvNeXt + ConvStem)在不同参数规模和FLOPs范围内均能产生最鲁棒的$\ell_\infty$模型,而ViT + ConvStem则在对未知威胁模型的泛化性上表现最佳。