Self-supervised representation learning methods mainly focus on image-level instance discrimination. This study explores the potential benefits of incorporating patch-level discrimination into existing methods to enhance the quality of learned representations by simultaneously looking at local and global visual features. Towards this idea, we present a straightforward yet effective patch-matching algorithm that can find the corresponding patches across the augmented views of an image. The augmented views are subsequently fed into a self-supervised learning framework employing Vision Transformer (ViT) as its backbone. The result is the generation of both image-level and patch-level representations. Leveraging the proposed patch-matching algorithm, the model minimizes the representation distance between not only the CLS tokens but also the corresponding patches. As a result, the model gains a more comprehensive understanding of both the entirety of the image as well as its finer details. We pretrain the proposed method on small, medium, and large-scale datasets. It is shown that our approach could outperform state-of-the-art image-level representation learning methods on both image classification and downstream tasks. Keywords: Self-Supervised Learning; Visual Representations; Local-Global Representation Learning; Patch-Wise Representation Learning; Vision Transformer (ViT)
翻译:自监督表示学习方法主要聚焦于图像级别的实例判别。本研究探索将补丁级别判别融入现有方法,通过同时观察局部与全局视觉特征来提升所学表示质量的潜在益处。基于此思路,我们提出一种直接而有效的补丁匹配算法,该算法能在图像的不同增强视图间找到对应补丁。随后,将增强视图输入以视觉Transformer(ViT)为骨干的自监督学习框架,从而生成图像级别与补丁级别的表示。利用所提出的补丁匹配算法,模型不仅最小化CLS令牌之间的表示距离,还最小化对应补丁之间的表示距离。由此,模型能获得对图像整体及其细节的更全面理解。我们在小规模、中规模和大规模数据集上对提出的方法进行预训练。结果表明,我们的方法在图像分类及下游任务上均能超越最先进的图像级别表示学习方法。关键词:自监督学习;视觉表示;局部-全局表示学习;补丁级别表示学习;视觉Transformer(ViT)