In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at https://github.com/PKU-ML/ClusterMIM.
翻译:在自监督学习领域,掩码图像建模与对比学习方法共同受到广泛关注。掩码图像建模旨在利用输入图像的未掩码部分重建其被掩码的区域。值得注意的是,一部分掩码图像建模方法采用离散标记作为重建目标,但这一选择的理论基础尚未得到充分探索。本文深入探究这些离散标记的作用,旨在揭示其优势与局限。基于掩码图像建模与对比学习之间的关联,我们系统阐述了离散标记化如何影响模型的泛化能力,并提出了一个名为TCAS的新颖度量指标,该指标专门用于评估掩码图像建模框架中离散标记的有效性。受此指标启发,我们贡献了一种创新的标记器设计,并提出相应的掩码图像建模方法ClusterMIM。该方法在多种基准数据集和ViT骨干网络上均展现出优越性能。代码发布于https://github.com/PKU-ML/ClusterMIM。