Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations. However, for high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA) - another prominent SSL paradigm. To understand this performance gap, we analyze the information flow in Vision Transformers (ViT) learned by both approaches. We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content. Moreover, we demonstrate that MIM-trained ViTs retain valuable information within their patch tokens, which is not effectively captured by the global [cls] token representations. Therefore, selective aggregation of relevant patch tokens, without any fine-tuning, results in consistently higher-quality of MIM representations. To our knowledge, we are the first to highlight the lack of effective representation aggregation as an emergent issue of MIM and propose directions to address it, contributing to future advances in Self-Supervised Learning.
翻译:掩码图像建模已成为视觉表征自监督学习的一种流行方法。然而,对于高层感知任务,MIM预训练模型的开箱即用表征质量低于联合嵌入架构——另一种重要的自监督学习范式。为理解这一性能差距,我们分析了两种方法所学习的视觉Transformer中的信息流动。我们发现,虽然JEA基于选定的相关图像片段构建其表征,但MIM模型聚合了几乎完整的图像内容。此外,我们证明MIM训练的ViT在其补丁标记中保留了有价值的信息,这些信息未能被全局[cls]标记表征有效捕获。因此,在不进行任何微调的情况下,选择性聚合相关补丁标记能够持续产生更高质量的MIM表征。据我们所知,我们首次指出有效表征聚合的缺失是MIM的一个新兴问题,并提出了解决方向,为自监督学习的未来进展作出贡献。