Online Unsupervised Video Object Segmentation via Contrastive Motion Clustering

Online unsupervised video object segmentation (UVOS) uses the previous frames as its input to automatically separate the primary object(s) from a streaming video without using any further manual annotation. A major challenge is that the model has no access to the future and must rely solely on the history, i.e., the segmentation mask is predicted from the current frame as soon as it is captured. In this work, a novel contrastive motion clustering algorithm with an optical flow as its input is proposed for the online UVOS by exploiting the common fate principle that visual elements tend to be perceived as a group if they possess the same motion pattern. We build a simple and effective auto-encoder to iteratively summarize non-learnable prototypical bases for the motion pattern, while the bases in turn help learn the representation of the embedding network. Further, a contrastive learning strategy based on a boundary prior is developed to improve foreground and background feature discrimination in the representation learning stage. The proposed algorithm can be optimized on arbitrarily-scale data i.e., frame, clip, dataset) and performed in an online fashion. Experiments on $\textit{DAVIS}_{\textit{16}}$, $\textit{FBMS}$, and $\textit{SegTrackV2}$ datasets show that the accuracy of our method surpasses the previous state-of-the-art (SoTA) online UVOS method by a margin of 0.8%, 2.9%, and 1.1%, respectively. Furthermore, by using an online deep subspace clustering to tackle the motion grouping, our method is able to achieve higher accuracy at $3\times$ faster inference time compared to SoTA online UVOS method, and making a good trade-off between effectiveness and efficiency.

翻译：在线无监督视频目标分割（UVOS）利用先前帧作为输入，从流式视频中自动分离出主要目标，无需任何额外人工标注。其主要挑战在于模型无法获取未来信息，必须仅依赖历史信息，即当当前帧被捕获时立即预测出分割掩码。本文提出了一种基于光流输入的新型对比运动聚类算法，通过利用共同命运原则（即视觉元素若具有相同运动模式，则倾向于被视为整体）来解决在线UVOS问题。我们构建了一个简单高效的自编码器，用于迭代地总结运动模式的非学习原型基，而这些基反过来有助于学习嵌入网络的表示。此外，我们还基于边界先验开发了一种对比学习策略，以在表示学习阶段增强前景与背景特征判别性。所提算法可在任意尺度数据（如帧、片段、数据集）上进行优化，并以在线方式运行。在$\textit{DAVIS}_{\textit{16}}$、$\textit{FBMS}$和$\textit{SegTrackV2}$数据集上的实验表明，本方法准确率分别超过现有最先进在线UVOS方法0.8%、2.9%和1.1%。此外，通过采用在线深度子空间聚类处理运动分组，本方法在推理速度提升3倍的同时实现了更高准确率，在有效性与效率之间取得了良好平衡。