We study learning object segmentation from unlabeled videos. Humans can easily segment moving objects without knowing what they are. The Gestalt law of common fate, i.e., what move at the same speed belong together, has inspired unsupervised object discovery based on motion segmentation. However, common fate is not a reliable indicator of objectness: Parts of an articulated / deformable object may not move at the same speed, whereas shadows / reflections of an object always move with it but are not part of it. Our insight is to bootstrap objectness by first learning image features from relaxed common fate and then refining them based on visual appearance grouping within the image itself and across images statistically. Specifically, we learn an image segmenter first in the loop of approximating optical flow with constant segment flow plus small within-segment residual flow, and then by refining it for more coherent appearance and statistical figure-ground relevance. On unsupervised video object segmentation, using only ResNet and convolutional heads, our model surpasses the state-of-the-art by absolute gains of 7/9/5% on DAVIS16 / STv2 / FBMS59 respectively, demonstrating the effectiveness of our ideas. Our code is publicly available.
翻译:我们研究从未标注视频中学习对象分割。人类能够轻松分割移动物体,即便不知其为何物。格式塔定律中的共同命运原理——即相同速度运动的元素属于同一整体——已启发了基于运动分割的无监督对象发现。然而,共同命运并非对象性的可靠指示符:铰接/可变形物体的不同部分可能以不同速度运动,而物体的阴影/反射虽始终随其运动却并非其组成部分。我们的洞见在于:首先通过松弛的共同命运学习图像特征,再基于图像内部及跨图像的视觉外观分组进行统计优化,从而引导对象性。具体而言,我们首先在通过恒定片段流加微小片内残差流近似光流的循环中学习图像分割器,随后通过更一致的外观与统计的图形-背景相关性对其进行优化。在无监督视频对象分割任务中,仅使用ResNet与卷积头,我们的模型在DAVIS16/STv2/FBMS59上分别以7%/9%/5%的绝对增益超越现有最优水平,充分验证了所提思想的有效性。我们的代码已公开。