Recent advances in self-supervised visual representation learning have paved the way for unsupervised methods tackling tasks such as object discovery and instance segmentation. However, discovering objects in an image with no supervision is a very hard task; what are the desired objects, when to separate them into parts, how many are there, and of what classes? The answers to these questions depend on the tasks and datasets of evaluation. In this work, we take a different approach and propose to look for the background instead. This way, the salient objects emerge as a by-product without any strong assumption on what an object should be. We propose FOUND, a simple model made of a single $conv1\times1$ initialized with coarse background masks extracted from self-supervised patch-based representations. After fast training and refining these seed masks, the model reaches state-of-the-art results on unsupervised saliency detection and object discovery benchmarks. Moreover, we show that our approach yields good results in the unsupervised semantic segmentation retrieval task. The code to reproduce our results is available at https://github.com/valeoai/FOUND.
翻译:自监督视觉表示学习的近期进展为无监督方法解决目标发现和实例分割等任务铺平了道路。然而,在无监督条件下从图像中发现目标是一项极具挑战性的任务:何为期望的目标?何时应将其分割为部件?存在多少个目标?它们属于哪些类别?这些问题的答案取决于具体任务和评估数据集。在本研究中,我们另辟蹊径,提出转而寻找背景区域。通过这种方式,显著目标作为副产品自然浮现,无需对目标定义做出任何强假设。我们提出了FOUND模型,该模型结构简洁,仅由单个$conv1\times 1$卷积层构成,其初始化参数来源于从自监督基于块的表示中提取的粗略背景掩码。经过快速训练并优化这些种子掩码后,模型在无监督显著性检测和目标发现基准上达到了当前最优结果。此外,我们证明了该方法在无监督语义分割检索任务中同样表现出色。重现结果的代码已开源至https://github.com/valeoai/FOUND。