Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance. Yet, how much these pre-training paradigms promote lightweight ViTs' performance is considerably less studied. In this work, we develop and benchmark several self-supervised pre-training methods on image classification tasks and some downstream dense prediction tasks. We surprisingly find that if proper pre-training is adopted, even vanilla lightweight ViTs show comparable performance to previous SOTA networks with delicate architecture design. It breaks the recently popular conception that vanilla ViTs are not suitable for vision tasks in lightweight regimes. We also point out some defects of such pre-training, e.g., failing to benefit from large-scale pre-training data and showing inferior performance on data-insufficient downstream tasks. Furthermore, we analyze and clearly show the effect of such pre-training by analyzing the properties of the layer representation and attention maps for related models. Finally, based on the above analyses, a distillation strategy during pre-training is developed, which leads to further downstream performance improvement for MAE-based pre-training. Code is available at https://github.com/wangsr126/mae-lite.
翻译:自监督学习在大规模视觉Transformer(ViTs)上作为预训练方法已取得良好的下游性能,然而这些预训练范式对轻量级ViTs性能的提升程度却鲜有研究。本文针对图像分类任务及部分下游密集预测任务,开发并基准测试了多种自监督预训练方法。令人惊讶的是,我们发现若采用恰当的预训练,即使是普通的轻量级ViT也能达到此前采用精巧架构设计的SOTA网络的同等性能水平。这打破了近期流行的"普通ViT不适用于轻量级视觉任务"的观点。同时,我们也指出了此类预训练的一些缺陷,例如无法从大规模预训练数据中获益,以及在数据不足的下游任务上表现不佳。此外,我们通过分析相关模型的层表示和注意力图属性,明确展示了此类预训练的效果。基于上述分析,我们进一步提出了一种预训练期间的蒸馏策略,该策略可显著提升基于MAE预训练的下游性能。代码开源地址:https://github.com/wangsr126/mae-lite。