Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

翻译：掩码图像建模（MIM）预训练在计算机视觉领域的大规模视觉Transformer（ViT）上，基于学习到的自监督ViT特征实现了有前景的下游性能。本文探讨了极简小规模架构ViT的微调性能是否也能受益于这种预训练范式——与引入复杂组件的成熟轻量化架构设计方法相比，该方向研究尚不充分。通过将多种典型MIM预训练方法谨慎适配至轻量化场景，并在下游图像分类与密集预测任务中与对比学习（CL）预训练进行比较，我们系统观察到MIM与CL在下游微调数据规模维度上的行为差异。进一步分析线性探测评估下的冻结特征、层表示相似性及跨模型的注意力图，清晰揭示MIM预训练在高层特征学习中存在不足，导致其在数据匮乏的下游任务中微调性能欠佳。该发现自然引导出在预训练阶段选择合适的蒸馏策略以解决上述退化问题。在多种视觉任务上的广泛实验证明了我们"观察-分析-解决"流程的有效性。特别地，采用蒸馏预训练的轻量化普通/层次化结构ViT（5.7M/6.5M参数）可在ImageNet-1K上实现79.4%/78.9%的Top-1准确率，并在轻量化场景下达到ADE20K语义分割任务（42.8% mIoU）与LaSOT视觉追踪任务（66.1% AUC）的最优性能，后者甚至超越现有所有轻量化CPU实时追踪器。