We consider the problem of segmenting objects in videos based on their motion and no other forms of supervision. Prior work has often approached this problem by using the principle of common fate, namely the fact that the motion of points that belong to the same object is strongly correlated. However, most authors have only considered instantaneous motion from optical flow. In this work, we present a way to train a segmentation network using long-term point trajectories as a supervisory signal to complement optical flow. The key difficulty is that long-term motion, unlike instantaneous motion, is difficult to model -- any parametric approximation is unlikely to capture complex motion patterns over long periods of time. We instead draw inspiration from subspace clustering approaches, proposing a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Our method outperforms the prior art on motion-based segmentation, which shows the utility of long-term motion and the effectiveness of our formulation.
翻译:本文研究基于物体运动且无需其他监督形式的视频对象分割问题。先前工作通常依据"共同命运"原则处理该问题,即属于同一物体的点运动具有强相关性。然而,大多数研究者仅考虑了光流提供的瞬时运动信息。本工作中,我们提出一种利用长期点轨迹作为监督信号来补充光流信息的分割网络训练方法。关键难点在于:与瞬时运动不同,长期运动难以建模——任何参数化近似都难以捕捉长时间内的复杂运动模式。为此,我们从子空间聚类方法中获得启发,提出一种损失函数,该函数试图将轨迹分组为低秩矩阵,其中物体点的运动可近似表示为其他点轨迹的线性组合。我们的方法在基于运动的分割任务上超越了现有技术,这证明了长期运动信息的有效性以及我们提出的公式的优越性。