Self-supervised learning holds promise in leveraging large numbers of unlabeled data. However, its success heavily relies on the highly-curated dataset, e.g., ImageNet, which still needs human cleaning. Directly learning representations from less-curated scene images is essential for pushing self-supervised learning to a higher level. Different from curated images which include simple and clear semantic information, scene images are more complex and mosaic because they often include complex scenes and multiple objects. Despite being feasible, recent works largely overlooked discovering the most discriminative regions for contrastive learning to object representations in scene images. In this work, we leverage the saliency map derived from the model's output during learning to highlight these discriminative regions and guide the whole contrastive learning. Specifically, the saliency map first guides the method to crop its discriminative regions as positive pairs and then reweighs the contrastive losses among different crops by its saliency scores. Our method significantly improves the performance of self-supervised learning on scene images by +1.1, +4.3, +2.2 Top1 accuracy in ImageNet linear evaluation, Semi-supervised learning with 1% and 10% ImageNet labels, respectively. We hope our insights on saliency maps can motivate future research on more general-purpose unsupervised representation learning from scene data.
翻译:自监督学习在利用大量无标签数据方面具有潜力,但其成功高度依赖经过精心整理的数据集(如ImageNet),这类数据集仍需人工清洗。直接从未经整理的真实场景图像中学习表征,对推动自监督学习向更高层次发展至关重要。与包含简单清晰语义信息的精选图像不同,场景图像更为复杂且碎片化,因其常包含复杂场景与多个目标。尽管已有可行方案,但近期研究大多忽略了在场景图像中为对比学习发现最具判别性区域以获取目标表征的探索。我们提出利用模型学习过程中生成的显著性图来强化这些判别性区域并引导整体对比学习过程:首先,显著性图引导方法将判别性区域裁剪为正样本对;其次,通过显著性分数对不同裁剪区域的对比损失进行重新加权。本方法在场景图像自监督学习上取得了显著性能提升,在ImageNet线性评估中Top1准确率提升1.1%,在半监督学习(使用ImageNet 1%和10%标签)中分别提升4.3%和2.2%。我们期望关于显著性图的洞见能推动基于场景数据的通用无监督表征学习研究。