There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2\% on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation. Code and models are available at https://github.com/open-mmlab/mmpretrain.
翻译:掩码图像建模(Masked Image Modeling,MIM)领域已取得显著进展。现有MIM方法根据重建目标可大致分为两类:基于像素的方法与基于分词器的方法。前者具有更简洁的流程和更低的计算成本,但存在偏向高频细节的局限性。本文通过一系列实证研究证实了基于像素MIM方法的这一局限,并提出一种新方法——显式利用浅层低级特征辅助像素重建。将该设计融入基础方法MAE后,我们减少了基于像素MIM中建模能力的浪费,加速了模型收敛,并在多种下游任务上实现了显著性能提升。据我们所知,这是首次针对各向同性架构(如标准Vision Transformer,ViT)系统研究多层级特征融合的工作。值得注意的是,当应用于小规模模型(如ViT-S)时,本方法在微调、线性探测和语义分割任务上分别取得了1.2%、2.8%和2.6%的性能提升。代码与模型已开源至https://github.com/open-mmlab/mmpretrain。