Contrastive Masked Autoencoders are Stronger Vision Learners

Masked image modeling (MIM) has achieved promising results on various vision tasks. However, the limited discriminability of learned representation manifests there is still plenty to go for making a stronger vision learner. Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new self-supervised pre-training method for learning more comprehensive and capable vision representations. By elaboratively unifying contrastive learning (CL) and masked image model (MIM) through novel designs, CMAE leverages their respective advantages and learns representations with both strong instance discriminability and local perceptibility. Specifically, CMAE consists of two branches where the online branch is an asymmetric encoder-decoder and the momentum branch is a momentum updated encoder. During training, the online encoder reconstructs original images from latent representations of masked images to learn holistic features. The momentum encoder, fed with the full images, enhances the feature discriminability via contrastive learning with its online counterpart. To make CL compatible with MIM, CMAE introduces two new components, i.e. pixel shifting for generating plausible positive views and feature decoder for complementing features of contrastive pairs. Thanks to these novel designs, CMAE effectively improves the representation quality and transfer performance over its MIM counterpart. CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection. Notably, CMAE-Base achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k, surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. The source code is publicly accessible at \url{https://github.com/ZhichengHuang/CMAE}.

翻译：掩码图像建模（MIM）已在多种视觉任务上取得了令人瞩目的成果。然而，所学表征的判别能力有限表明，要构建更强大的视觉学习器仍有大量工作有待完成。为此，我们提出对比掩码自编码器（CMAE），这是一种新的自监督预训练方法，旨在学习更全面、更强大的视觉表征。通过新颖的设计巧妙融合对比学习（CL）与掩码图像建模（MIM），CMAE 兼具两者的优势，学习得到同时具备强实例判别性与局部感知能力的表征。具体而言，CMAE 包含两个分支：在线分支为非对称编码器-解码器结构，动量分支则为动量更新编码器。训练过程中，在线编码器从掩码图像的隐层表征重建原始图像以学习整体特征；动量编码器则输入完整图像，通过与其在线对应部分的对比学习增强特征判别性。为使对比学习兼容掩码图像建模，CMAE 引入了两个新组件：用于生成合理正视图的像素偏移模块，以及用于补充对比对特征的解码器。得益于这些创新设计，CMAE 相比其掩码图像建模对应方法有效提升了表征质量与迁移性能。在图像分类、语义分割与目标检测等高竞争性基准测试中，CMAE 均取得了最先进水平。值得注意的是，CMAE-Base 在 ImageNet 上达到 85.3% 的 Top-1 准确率，在 ADE20k 上达到 52.5% 的 mIoU，分别超越此前最优结果 0.7% 与 1.8%。源代码已公开于 \url{https://github.com/ZhichengHuang/CMAE}。