The existing contrastive learning methods widely adopt one-hot instance discrimination as pretext task for self-supervised learning, which inevitably neglects rich inter-instance similarities among natural images, then leading to potential representation degeneration. In this paper, we propose a novel image mix method, PatchMix, for contrastive learning in Vision Transformer (ViT), to model inter-instance similarities among images. Following the nature of ViT, we randomly mix multiple images from mini-batch in patch level to construct mixed image patch sequences for ViT. Compared to the existing sample mix methods, our PatchMix can flexibly and efficiently mix more than two images and simulate more complicated similarity relations among natural images. In this manner, our contrastive framework can significantly reduce the gap between contrastive objective and ground truth in reality. Experimental results demonstrate that our proposed method significantly outperforms the previous state-of-the-art on both ImageNet-1K and CIFAR datasets, e.g., 3.0% linear accuracy improvement on ImageNet-1K and 8.7% kNN accuracy improvement on CIFAR100. Moreover, our method achieves the leading transfer performance on downstream tasks, object detection and instance segmentation on COCO dataset. The code is available at https://github.com/visresearch/patchmix.
翻译:现有对比学习方法普遍采用独热(one-hot)实例判别作为自监督学习的预文本任务,这不可避免地忽略了自然图像间丰富的实例间相似性,进而导致潜在的表征退化。本文提出一种新型图像混合方法PatchMix,用于Vision Transformer (ViT)中的对比学习,以建模图像间的实例间相似性。遵循ViT的特性,我们从小批量数据中随机选取多张图像,在补丁(patch)层面进行混合,构建出用于ViT的混合图像补丁序列。相较于现有样本混合方法,我们的PatchMix能够灵活高效地混合两张以上的图像,并模拟自然图像间更复杂的相似性关系。通过这种方式,我们的对比学习框架可显著缩小对比学习目标与真实标签之间的差距。实验结果表明,所提方法在ImageNet-1K和CIFAR数据集上均显著超越先前最优方法,例如在ImageNet-1K上线性评估准确率提升3.0%,在CIFAR100上kNN准确率提升8.7%。此外,我们的方法在下游任务(COCO数据集上的目标检测与实例分割)中取得了领先的迁移性能。代码已开源至https://github.com/visresearch/patchmix。