Self-supervised learning holds promise in leveraging large numbers of unlabeled data. However, its success heavily relies on the highly-curated dataset, e.g., ImageNet, which still needs human cleaning. Directly learning representations from less-curated scene images is essential for pushing self-supervised learning to a higher level. Different from curated images which include simple and clear semantic information, scene images are more complex and mosaic because they often include complex scenes and multiple objects. Despite being feasible, recent works largely overlooked discovering the most discriminative regions for contrastive learning to object representations in scene images. In this work, we leverage the saliency map derived from the model's output during learning to highlight these discriminative regions and guide the whole contrastive learning. Specifically, the saliency map first guides the method to crop its discriminative regions as positive pairs and then reweighs the contrastive losses among different crops by its saliency scores. Our method significantly improves the performance of self-supervised learning on scene images by +1.1, +4.3, +2.2 Top1 accuracy in ImageNet linear evaluation, Semi-supervised learning with 1% and 10% ImageNet labels, respectively. We hope our insights on saliency maps can motivate future research on more general-purpose unsupervised representation learning from scene data.
翻译:自监督学习在利用大量未标注数据方面具有潜力。然而,其成功高度依赖精心整理的数据集(例如ImageNet),这些数据集仍需人工清洗。直接从未经整理的自然场景图像中学习表征,对于将自监督学习推向更高水平至关重要。与包含简单清晰语义信息的整理图像不同,场景图像更为复杂和碎片化,因其常包含复杂场景与多个物体。尽管现有方法可行,但近期研究大多忽略了在场景图像中发掘最具判别性的区域以进行目标表征的对比学习。在本研究中,我们利用模型学习过程中生成的显著性图来突出这些判别性区域,并引导整个对比学习过程。具体而言,显著性图首先指导方法裁剪其判别性区域作为正样本对,随后通过显著性分数对不同裁剪区域之间的对比损失进行重新加权。我们的方法在场景图像的自监督学习上取得了显著提升:ImageNet线性评估Top1准确率提升1.1%,使用1%和10%ImageNet标签的半监督学习分别提升4.3%和2.2%。我们期望关于显著性图的见解能激励未来从场景数据中进行更通用的无监督表征学习研究。