Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. Code is available at https://github.com/NVlabs/STL.
翻译:近期研究表明,视觉Transformer(ViTs)在分布外场景下具有鲁棒性。其中,全注意力网络(FAN)——一类ViT骨干架构——已取得最先进的鲁棒性表现。本文中,我们重新审视FAN模型,并通过自涌现标记标签(STL)框架改进其预训练过程。该方法包含两阶段训练框架:首先训练FAN标记标签生成器(FAN-TL)以产生具有语义意义的块级标记标签,随后进入使用标记标签与原始类别标签的FAN学生模型训练阶段。通过所提出的STL框架,基于FAN-L-Hybrid(77.3M参数)的最优模型在ImageNet-1K和ImageNet-C上分别达到84.8% Top-1准确率与42.1%平均分类误差(mCE),并在ImageNet-A(46.1%)和ImageNet-R(56.6%)上创下不使用额外数据的新纪录,显著超越原始FAN对应模型。该框架在语义分割等下游任务中亦展现出显著增强的性能,相较对应模型鲁棒性提升幅度达1.7%。代码详见https://github.com/NVlabs/STL。