Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Code and pre-trained models will be released at https://github.com/Alpha-VL/ConvMAE.
翻译:掩码自编码器(MAE)已成为大规模视觉表征预训练的流行范式。然而,MAE仅在解码器后重建低层RGB信号,缺乏对编码器高层语义的监督,导致学到次优的表征并需要较长的预训练周期。为缓解此问题,先前方法简单地将75%被掩码令牌的像素重建目标替换为预训练图像-图像(DINO)或图像-语言(CLIP)对比学习编码的特征。与这些努力不同,我们提出针对掩码自编码器的"模仿后重建"方法,命名为MR-MAE,该方法在预训练过程中无干扰地联合学习高层和低层表征。对于高层语义,MR-MAE对编码器25%可见令牌采用模仿损失,以捕获CLIP和DINO中编码的预训练模式。对于低层结构,我们继承MAE中的重建损失,在解码器后预测75%被掩码令牌的RGB像素值。由于MR-MAE分别在不同分区应用高层和低层目标,两者之间的学习冲突可自然克服,从而为各种下游任务贡献更优的视觉表征。在ImageNet-1K上,仅预训练400个epochs的MR-MAE基础模型微调后达到85.8%的top-1准确率,超越1600个epochs的MAE基础模型+2.2%,并超过先前最先进的BEiT V2基础模型+0.3%。代码和预训练模型将在https://github.com/Alpha-VL/ConvMAE 发布。