Masked image modeling (MIM) is a promising option for training Vision Transformers among various self-supervised learning (SSL) methods. The essence of MIM lies in token-wise masked token predictions, with targets tokenized from images or generated by pre-trained models such as vision-language models. While tokenizers or pre-trained models are plausible MIM targets, they often offer spatially inconsistent targets even for neighboring tokens, complicating models to learn unified discriminative representations. Our pilot study confirms that addressing spatial inconsistencies has the potential to enhance representation quality. Motivated by the findings, we introduce a novel self-supervision signal called Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens to yield contextualized targets. DTM is compatible with various SSL frameworks; we showcase an improved MIM by employing DTM, barely introducing extra training costs. Our experiments on ImageNet-1K and ADE20K demonstrate the superiority of our methods compared with state-of-the-art, complex MIM methods. Furthermore, the comparative evaluation of the iNaturalists and fine-grained visual classification datasets further validates the transferability of our method on various downstream tasks. Code is available at https://github.com/naver-ai/dtm
翻译:掩码图像建模(MIM)是各种自监督学习(SSL)方法中训练视觉Transformer的一种有前景的选择。MIM的核心在于基于令牌的掩码令牌预测,其目标来自图像的分词或由预训练模型(如视觉语言模型)生成。尽管分词器或预训练模型是合理的MIM目标,但它们通常甚至为相邻令牌提供空间不一致的目标,使模型难以学习统一的判别性表示。我们的初步研究证实,解决空间不一致性有潜力提升表示质量。受此发现启发,我们引入一种名为动态令牌形变(DTM)的新型自监督信号,该信号动态聚合上下文相关的令牌以生成上下文化目标。DTM兼容多种SSL框架;我们通过采用DTM展示了改进的MIM,且几乎不引入额外训练成本。在ImageNet-1K和ADE20K上的实验表明,我们的方法相较于最先进的复杂MIM方法具有优越性。此外,在iNaturalists和细粒度视觉分类数据集上的比较评估进一步验证了我们的方法在各种下游任务上的可迁移性。代码可在https://github.com/naver-ai/dtm获取。