Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.
翻译:掩码图像建模作为一种新兴的自监督预训练方法,已在基于视觉Transformer的诸多下游视觉任务中取得显著成功。其核心思想简单直观:对输入图像的部分区域进行掩码,并通过前置任务进行重建。然而,MIM的工作原理尚未得到充分阐释,且既往研究坚持认为MIM主要适用于Transformer系列架构,与CNN不兼容。本研究发现,MIM本质上是引导模型学习补丁间更优的中阶交互特征,以实现更通用的特征提取。据此,我们提出架构无关的掩码图像建模框架(A$^2$MIM),该框架以统一方式兼容Transformer与CNN。在主流基准数据集上的大量实验表明,A$^2$MIM无需显式设计即可学习到更优的表征,并赋予骨干模型更强的能力以适应各类下游任务迁移。