Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks. Specifically, we redesign the traditional graph cut on images to include motion information in a linear combination with appearance information to produce edge weights. Remarkably, this step produces object segmentation masks comparable to the current state-of-the-art on multiple benchmarks. To further improve performance, we bootstrap a segmentation network trained on these preliminary masks as pseudo-ground truths to learn from its own outputs via self-training. We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks, achieving results on par with and, in many cases surpassing state-of-the-art methods. We also demonstrate the transferability of our approach to novel domains through a qualitative study on in-the-wild images. Additionally, we present extensive ablation analysis to support our design choices and highlight the contribution of each component of our proposed method.
翻译:在图像和视频数据集中无需人工监督学习物体分割是一个具有挑战性的问题。人类利用共同命运格式塔原则(即共同运动的事物属于同一整体)能够轻松识别视频中显著移动的物体。基于这一思想,我们提出了一种自监督物体发现方法,利用运动与外观信息生成高质量的物体分割掩码。具体而言,我们重新设计了传统图像上的图割算法,将运动信息与外观信息以线性组合的方式纳入边权重的计算中。值得注意的是,该步骤在多个基准测试上生成的物体分割掩码可媲美当前最先进方法。为进一步提升性能,我们采用自训练方式:将基于这些初步掩码训练的分割网络作为伪真实标签,使其从自身输出中学习。我们提出的方法命名为LOCATE,在多个标准视频物体分割、图像显著性检测和物体分割基准测试中展示了其有效性,结果与最先进方法相当,并在许多情况下超越之。通过针对野外图像的定性研究,我们还证明了方法向新领域的迁移能力。此外,我们进行了广泛的消融分析以支撑设计选择,并突出方法中各组成部分的贡献。