Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$, and $\textit{SegTrackV2}$ datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at $3\times$ faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency. Our code is available at https://github.com/xilin1991/ClusterNet.

翻译：在线无监督视频目标分割利用先前帧作为输入，在不依赖人工标注的情况下从流式视频中自动分离主要目标。其核心挑战在于模型无法预知未来帧，必须完全依赖历史信息——即当当前帧被捕获时立即预测对应分割掩码。本文提出一种新颖的对比运动聚类算法，以光流作为输入，通过利用"共同命运原则"（即视觉元素若具有相同运动模式则倾向于被感知为整体）来实现在线UVOS。我们构建了简洁有效的自编码器，通过迭代方式总结运动模式的非学习原型基，而这些基向量反过来有助于优化嵌入网络的表示学习。此外，我们开发了基于边界先验的对比学习策略，在表示学习阶段提升前景与背景特征的辨别能力。该算法可在任意尺度数据（帧、片段、数据集）上优化，并支持在线运行。在$\textit{DAVIS}_{\textit{16}}$、$\textit{FBMS}$和$\textit{SegTrackV2}$数据集上的实验表明，本方法精度分别超越现有最先进的在线UVOS方法达0.8%、2.9%和1.1%。同时，通过采用在线深度子空间聚类处理运动分组，本方法在以3倍推理速度实现更高精度的同时，在效率与效果间取得了良好平衡。代码开源地址：https://github.com/xilin1991/ClusterNet