In Computer Vision, self-supervised contrastive learning enforces similar representations between different views of the same image. The pre-training is most often performed on image classification datasets, like ImageNet, where images mainly contain a single class of objects. However, when dealing with complex scenes with multiple items, it becomes very unlikely for several views of the same image to represent the same object category. In this setting, we propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into semantic regions, then sample the two views from the same region. Preliminary results show empirically that when pre-training on Cityscapes and ADE20K, then evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs at least on par with, and most often significantly outperforms not only SimCLR, but also DINO and MoCo.
翻译:在计算机视觉中,自监督对比学习强制同一图像不同视图之间的表示相似。预训练通常在图像分类数据集(如ImageNet)上进行,这些数据集的图像主要包含单一类别的物体。然而,当处理包含多个物体的复杂场景时,同一图像的不同视图极有可能对应不同的物体类别。针对此问题,我们提出SAMCLR——SimCLR的附加模块,其利用SAM将图像分割为语义区域,并从同一区域中采样两个视图。初步实验结果表明,在Cityscapes和ADE20K上进行预训练、并在CIFAR-10、STL10和ImageNette上评估分类性能时,SAMCLR不仅与SimCLR表现相当,而且在大多数情况下显著优于DINO和MoCo。