Deeper Vision Transformers (ViTs) are more challenging to train. We expose a degradation problem in deeper layers of ViT when using masked image modeling (MIM) for pre-training. To ease the training of deeper ViTs, we introduce a self-supervised learning framework called Masked Image Residual Learning (MIRL), which significantly alleviates the degradation problem, making scaling ViT along depth a promising direction for performance upgrade. We reformulate the pre-training objective for deeper layers of ViT as learning to recover the residual of the masked image. We provide extensive empirical evidence showing that deeper ViTs can be effectively optimized using MIRL and easily gain accuracy from increased depth. With the same level of computational complexity as ViT-Base and ViT-Large, we instantiate 4.5$\times$ and 2$\times$ deeper ViTs, dubbed ViT-S-54 and ViT-B-48. The deeper ViT-S-54, costing 3$\times$ less than ViT-Large, achieves performance on par with ViT-Large. ViT-B-48 achieves 86.2% top-1 accuracy on ImageNet. On one hand, deeper ViTs pre-trained with MIRL exhibit excellent generalization capabilities on downstream tasks, such as object detection and semantic segmentation. On the other hand, MIRL demonstrates high pre-training efficiency. With less pre-training time, MIRL yields competitive performance compared to other approaches.
翻译:更深的视觉Transformer(ViT)在训练中面临更大挑战。我们发现,当采用掩码图像建模(MIM)进行预训练时,ViT的更深层存在退化问题。为缓解深层ViT的训练难题,我们提出名为掩码图像残差学习(MIRL)的自监督学习框架,该框架显著缓解了退化问题,使沿深度方向扩展ViT成为性能提升的有效途径。我们将ViT更深层的预训练目标重构为学习恢复掩码图像的残差。大量实验证据表明,采用MIRL可有效优化深层ViT,并轻松通过增加深度提升精度。在与ViT-Base和ViT-Large相当的计算复杂度下,我们实例化了深度分别扩大4.5倍和2倍的ViT模型,即ViT-S-54和ViT-B-48。计算量仅为ViT-Large三分之一的ViT-S-54,性能与ViT-Large持平。ViT-B-48在ImageNet上达到86.2%的Top-1精度。一方面,采用MIRL预训练的深层ViT在下游任务(如目标检测和语义分割)中展现出卓越的泛化能力;另一方面,MIRL具有高效的预训练特性,在更短预训练时间内即可获得与其他方法相媲美的性能。