Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Code is available at https://github.com/naver-ai/lut
翻译:掩码图像建模(MIM)已成为领先的自监督学习策略。诸如掩码自编码器(MAE)等MIM方法通过随机掩码输入令牌供编码器处理,并由解码器将掩码令牌重建为输入,从而学习到强大的表示。然而,MIM预训练编码器通常表现出有限的注意力范围,这归因于MIM仅专注于回归掩码令牌,可能阻碍编码器进行更广泛的上下文学习。为解决这一局限性,我们通过显式地将非掩码令牌纳入训练过程来改进MIM。具体来说,我们的方法使编码器能够从更广泛的上下文监督中学习,允许非掩码令牌体验更广泛的上下文,同时解码器重建掩码令牌。因此,编码后的非掩码令牌配备了丰富的上下文信息,使掩码令牌能够利用增强后的非掩码令牌进行MIM。最终,这一简单的改进方法训练出了更具判别性的表示,在ImageNet-1K上使用ViT-B实现了84.2%的Top-1准确率,提升了0.6个百分点。我们将成功归因于增强的预训练方法,这一点通过奇异值谱和注意力分析得到了证实。最后,我们的模型在下游语义分割和细粒度视觉分类任务中,以及在多种鲁棒性评估指标上均取得了显著的性能提升。代码已开源至 https://github.com/naver-ai/lut 。