Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$, and $\textit{SegTrackV2}$ datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at $3\times$ faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency.

翻译：在线无监督视频目标分割（UVOS）利用先前帧作为输入，自动从流式视频中分离出主要目标，无需任何额外的人工标注。其主要挑战在于模型无法访问未来帧，必须完全依赖历史信息，即一旦当前帧被捕获，立即预测其分割掩码。本文基于共同命运原则（视觉元素若具有相同运动模式则倾向于被视为一个群体），提出了一种新颖的对比运动聚类算法，以光流作为输入用于在线UVOS。我们构建了一个简单有效的自编码器，迭代地总结运动模式的非可学习原型基，而这些基反过来帮助学习嵌入网络的表示。此外，我们开发了基于边界先验的对比学习策略，在表示学习阶段提升前景与背景特征区分度。所提算法可针对任意尺度数据（如帧、片段、数据集）进行优化，并以在线方式执行。在$\textit{DAVIS}_{\textit{16}}$、$\textit{FBMS}$和$\textit{SegTrackV2}$数据集上的实验表明，我们的方法精度分别超越先前最先进（SoTA）在线UVOS方法0.8%、2.9%和1.1%。此外，通过采用在线深度子空间聚类处理运动分组，我们的方法在推理速度上相比SoTA在线UVOS方法提升3倍的同时实现了更高精度，在效果与效率之间取得了良好平衡。