Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called \textbf{M}asked \textbf{I}mage \textbf{R}esidual \textbf{L}earning (\textbf{MIRL}), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5{$\times$} and 2{$\times$} deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3{$\times$} less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2\% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.
翻译:更深层的视觉Transformer(ViT)在训练中更具挑战性。我们发现,当使用掩码图像建模(MIM)进行预训练时,ViT深层会出现退化问题。为缓解深层ViT的训练困难,我们提出了一种名为**掩码图像残差学习**(**MIRL**)的自监督学习框架。该框架显著减轻了退化问题,使沿深度方向扩展ViT成为提升性能的有效途径。我们重新定义了ViT深层的预训练目标:学习恢复掩码图像的残差。大量实验证据表明,采用MIRL可有效优化深层ViT,并使其随深度增加而稳定提升准确率。在计算复杂度与ViT-Base和ViT-Large相同的条件下,我们构建了深度为4.5倍和2倍的ViT变体,分别称为ViT-S-54和ViT-B-48。其中,更深的ViT-S-54虽计算量仅为ViT-Large的1/3,但性能可与之媲美;ViT-B-48则在ImageNet上达到86.2%的top-1准确率。一方面,经MIRL预训练的深层ViT在下游任务(如目标检测和语义分割)中展现出卓越的泛化能力;另一方面,MIRL具有极高的预训练效率,能在更短的训练时间内获得与其它方法相竞争的性能。