Recent masked image modeling (MIM) has received much attention in self-supervised learning (SSL), which requires the target model to recover the masked part of the input image. Although MIM-based pre-training methods achieve new state-of-the-art performance when transferred to many downstream tasks, the visualizations show that the learned representations are less separable, especially compared to those based on contrastive learning pre-training. This inspires us to think whether the linear separability of MIM pre-trained representation can be further improved, thereby improving the pre-training performance. Since MIM and contrastive learning tend to utilize different data augmentations and training strategies, combining these two pretext tasks is not trivial. In this work, we propose a novel and flexible pre-training framework, named MimCo, which combines MIM and contrastive learning through two-stage pre-training. Specifically, MimCo takes a pre-trained contrastive learning model as the teacher model and is pre-trained with two types of learning targets: patch-level and image-level reconstruction losses. Extensive transfer experiments on downstream tasks demonstrate the superior performance of our MimCo pre-training framework. Taking ViT-S as an example, when using the pre-trained MoCov3-ViT-S as the teacher model, MimCo only needs 100 epochs of pre-training to achieve 82.53% top-1 finetuning accuracy on Imagenet-1K, which outperforms the state-of-the-art self-supervised learning counterparts.
翻译:摘要:近期,掩码图像建模(MIM)在自监督学习(SSL)中受到广泛关注,该方法要求目标模型恢复输入图像中被掩码的部分。尽管基于MIM的预训练方法在迁移至多项下游任务时取得了新的最优性能,但可视化结果表明,相比于基于对比学习预训练的方法,学习到的表示可分离性较差。这启发我们思考:能否进一步提升MIM预训练表示的线性可分离性,从而改善预训练性能?由于MIM与对比学习倾向于采用不同的数据增强和训练策略,将这两种前置任务结合并非易事。本文提出了一种新颖且灵活的预训练框架,名为MimCo,通过两阶段预训练将MIM与对比学习相结合。具体而言,MimCo以预训练的对比学习模型作为教师模型,并借助两类学习目标进行预训练:图像块级和图像级重建损失。在下游任务上的大量迁移实验证明了MimCo预训练框架的优越性能。以ViT-S为例,当采用预训练的MoCov3-ViT-S作为教师模型时,MimCo仅需100轮预训练即可在ImageNet-1K上达到82.53%的Top-1微调准确率,这超越了当前最优的自监督学习方法。