Inspired by masked language modeling (MLM) in natural language processing, masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training method in computer vision. However, its high random mask ratio would result in two serious problems: 1) the data are not efficiently exploited, which brings inefficient pre-training (\eg, 1600 epochs for MAE $vs.$ 300 epochs for the supervised), and 2) the high uncertainty and inconsistency of the pre-trained model, \ie, the prediction of the same patch may be inconsistent under different mask rounds. To tackle these problems, we propose efficient masked autoencoders with self-consistency (EMAE), to improve the pre-training efficiency and increase the consistency of MIM. In particular, we progressively divide the image into K non-overlapping parts, each of which is generated by a random mask and has the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and generates predictions. Besides, we design a self-consistency module to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, the proposed method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves even higher results with only 300 pre-training epochs under ViT-Base than MAE (1600 epochs). EMAE also consistently obtains state-of-the-art transfer performance on various downstream tasks, like object detection, and semantic segmentation.
翻译:受自然语言处理中掩码语言建模(MLM)启发,掩码图像建模(MIM)已被公认为计算机视觉领域一种强大且流行的自监督预训练方法。然而,其高随机掩码比率会导致两个严重问题:1)数据利用效率低下,导致预训练效率低下(如MAE需1600轮次,而监督学习仅需300轮次);2)预训练模型存在高不确定性与不一致性,即同一图像块在不同掩码轮次下的预测结果可能不一致。为解决这些问题,我们提出高效自一致性掩码自编码器(EMAE),以提高MIM的预训练效率与一致性。具体而言,我们逐步将图像划分为K个不重叠部分,每部分通过随机掩码生成且具有相同的掩码比率。随后,在一次迭代中并行地对所有部分执行MIM任务并生成预测。此外,我们设计了自一致性模块,进一步维持各部分间重叠掩码块预测的一致性。综上,所提方法能够更高效地利用数据,并获取可靠的表征。在ImageNet上的实验表明,基于ViT-Base架构,EMAE仅需300轮预训练即可获得优于MAE(1600轮)的结果。EMAE在目标检测、语义分割等各类下游任务中亦持续取得最先进的迁移性能。