Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. It excels in region-aware learning and provides strong initializations for various tasks, but struggles to capture high-level semantics without further supervised fine-tuning, likely due to the low-level nature of its pixel reconstruction objective. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. However, this approach poses significant training challenges as the reconstruction targets are learned in conjunction with the model, potentially leading to trivial or suboptimal solutions.Our study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM. Through a series of carefully designed experiments and extensive analysis, we identify the source of these challenges, including representation collapsing for joint online/target optimization, learning objectives, the high region correlation in latent space and decoding conditioning. By sequentially addressing these issues, we demonstrate that Latent MIM can indeed learn high-level representations while retaining the benefits of MIM models.
翻译:掩码图像建模(MIM)已成为一种从无标签图像数据中学习视觉表征的有前景方法,其通过预测图像掩码部分的缺失像素来实现。该方法在区域感知学习方面表现出色,并为多种任务提供了强大的初始化,但由于其像素重建目标的低级特性,在没有进一步监督微调的情况下,难以捕获高级语义。一种有前景但尚未实现的框架是在潜在空间中进行掩码重建来学习表征,从而将MIM的局部性与高级目标相结合。然而,这种方法带来了显著的训练挑战,因为重建目标是与模型联合学习的,可能导致平凡或次优解。我们的研究是首批深入分析并应对此类框架(我们称之为潜在MIM)挑战的工作之一。通过一系列精心设计的实验和广泛分析,我们识别了这些挑战的根源,包括联合在线/目标优化中的表征坍缩、学习目标、潜在空间中的高区域相关性以及解码条件化。通过依次解决这些问题,我们证明了潜在MIM确实能够学习高级表征,同时保留MIM模型的优势。