While adversarial training has been extensively studied for ResNet architectures and low resolution datasets like CIFAR, much less is known for ImageNet. Given the recent debate about whether transformers are more robust than convnets, we revisit adversarial training on ImageNet comparing ViTs and ConvNeXts. Extensive experiments show that minor changes in architecture, most notably replacing PatchStem with ConvStem, and training scheme have a significant impact on the achieved robustness. These changes not only increase robustness in the seen $\ell_\infty$-threat model, but even more so improve generalization to unseen $\ell_1/\ell_2$-robustness. Our modified ConvNeXt, ConvNeXt + ConvStem, yields the most robust models across different ranges of model parameters and FLOPs.
翻译:尽管对抗训练已在ResNet架构和CIFAR等低分辨率数据集上得到广泛研究,但在ImageNet上的相关认知仍十分有限。鉴于近期关于Transformer是否比卷积网络更具鲁棒性的争议,我们重新审视了ImageNet上的对抗训练,对比了ViT与ConvNeXt。广泛实验表明,架构的微小变化(尤其是将PatchStem替换为ConvStem)以及训练方案对最终鲁棒性具有显著影响。这些改变不仅提升了在已知$\ell_\infty$威胁模型下的鲁棒性,更显著增强了对未知$\ell_1/\ell_2$威胁模型的泛化能力。经改进的ConvNeXt(即ConvNeXt + ConvStem)在不同模型参数量和FLOPs范围内均能获得最鲁棒的模型。