Inspired by the masked language modeling (MLM) in natural language processing tasks, the masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision. However, the high random mask ratio of MIM results in two serious problems: 1) the inadequate data utilization of images within each iteration brings prolonged pre-training, and 2) the high inconsistency of predictions results in unreliable generations, $i.e.$, the prediction of the identical patch may be inconsistent in different mask rounds, leading to divergent semantics in the ultimately generated outcomes. To tackle these problems, we propose the efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency and increase the consistency of MIM. In particular, we present a parallel mask strategy that divides the image into K non-overlapping parts, each of which is generated by a random mask with the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and the model minimizes the loss between the predictions and the masked patches. Besides, we design the self-consistency learning to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, our method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves the best performance on ViT-Large with only 13% of MAE pre-training time using NVIDIA A100 GPUs. After pre-training on diverse datasets, EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
翻译:受自然语言处理任务中掩码语言建模(MLM)的启发,掩码图像建模(MIM)已被认为是计算机视觉中一种强大的自监督预训练方法。然而,MIM的高随机掩码比例导致了两个严重问题:1)每次迭代中图像数据利用不足导致预训练时间延长;2)预测结果的高度不一致性导致生成不可靠,即相同图像块在不同掩码轮次中的预测可能不一致,最终导致生成结果的语义出现分歧。为解决这些问题,我们提出了具有自一致性的高效掩码自编码器(EMAE),以提高预训练效率并增强MIM的一致性。具体而言,我们提出了一种并行掩码策略,将图像划分为K个不重叠的部分,每个部分通过相同掩码比例的随机掩码生成。随后,MIM任务在单次迭代中对所有部分并行执行,模型通过最小化预测值与掩码块之间的损失进行优化。此外,我们设计了自一致性学习机制,以进一步保持各部分间重叠掩码块预测结果的一致性。总体而言,我们的方法能够更高效地利用数据并获得可靠的表示。在ImageNet上的实验表明,EMAE在仅使用MAE预训练时间13%的情况下(基于NVIDIA A100 GPU),在ViT-Large模型上取得了最佳性能。在多样化数据集上进行预训练后,EMAE在图像分类、目标检测和语义分割等多种下游任务中持续展现出最先进的迁移能力。