In Computer Vision, self-supervised contrastive learning enforces similar representations between different views of the same image. The pre-training is most often performed on image classification datasets, like ImageNet, where images mainly contain a single class of objects. However, when dealing with complex scenes with multiple items, it becomes very unlikely for several views of the same image to represent the same object category. In this setting, we propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into semantic regions, then sample the two views from the same region. Preliminary results show empirically that when pre-training on Cityscapes and ADE20K, then evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs at least on par with, and most often significantly outperforms not only SimCLR, but also DINO and MoCo.
翻译:在计算机视觉中,自监督对比学习通过强制同一图像不同视图具有相似表征来进行训练。预训练通常基于图像分类数据集(如ImageNet)进行,这类数据集中的图像主要包含单一类别物体。然而,当处理包含多个物体的复杂场景时,同一图像的不同视图极难表征相同的物体类别。针对此问题,我们提出SAMCLR——一种基于SimCLR的增强方法,通过SAM将图像分割为语义区域,并从相同区域采样两个视图。初步结果表明,在Cityscapes和ADE20K上进行预训练,并在CIFAR-10、STL10和ImageNette数据集上评估分类性能时,SAMCLR不仅与SimCLR性能相当,且多数情况下显著优于DINO和MoCo。