Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

翻译：自监督预训练已成为机器学习中的热门技术，无需标注数据即可提取有意义的特征表示。在计算机视觉领域，预训练视觉Transformer在推进迁移学习中发挥了关键作用。然而，随着模型规模的爆炸式增长，微调这些大型模型的成本不断攀升，构成了挑战。本研究旨在评估纯自监督学习技术在计算机视觉任务中的有效性，避免微调需求，以模仿人类对未见物体的泛化与识别能力。为此，我们提出了一种基于提示补丁的零样本分割评估协议。以目标物体上的一个点作为提示，算法计算所选补丁与其他补丁之间的相似度图，在此基础上应用简单的阈值分割目标。另一项评估是对象内与对象间相似度，以衡量自监督预训练ViT的判别能力。基于零样本分割提示与自监督预训练ViT判别能力的洞察，我们设计了一种简单的自监督预训练方法，命名为MMC。该方法结合了掩膜图像建模以促进局部特征相似性、基于动量的自蒸馏将语义从全局特征传递至局部特征，以及全局对比以增强全局特征的语义性，从而提升自监督预训练ViT的判别性表征能力。因此，我们提出的方法显著减少了对象内与对象间相似度的重叠，促进了图像内有效的物体分割。实验表明，MMC在多个数据集上的零样本语义分割中取得了顶尖结果。