The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution - cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design.
翻译:本文旨在实现视频目标分割的自监督学习。我们开发了一个统一框架,该框架同时建模跨帧密集对应关系以进行局部判别特征学习,并嵌入目标级上下文以进行目标掩码解码。因此,与先前通常依赖间接解决方案(即根据像素级相关性廉价"复制"标签)的工作不同,该框架能够直接从无标签视频中学习执行掩码引导的序列分割。具体而言,我们的算法交替执行以下步骤:i) 对视频像素进行聚类以从零创建伪分割标签;ii) 利用这些伪标签学习用于视频目标分割的掩码编码与解码。无监督对应学习被进一步纳入这种自教导的掩码嵌入方案中,以确保学习表示的通用性并避免聚类退化。我们的算法在两个标准基准(即DAVIS17和YouTube-VOS)上达到了最先进的性能,在性能和网络架构设计方面缩小了自监督与全监督视频目标分割之间的差距。