Unsupervised video object segmentation has made significant progress in recent years, but the manual annotation of video mask datasets is expensive and limits the diversity of available datasets. The Segment Anything Model (SAM) has introduced a new prompt-driven paradigm for image segmentation, unlocking a range of previously unexplored capabilities. In this paper, we propose a novel paradigm called UVOSAM, which leverages SAM for unsupervised video object segmentation without requiring video mask labels. To address SAM's limitations in instance discovery and identity association, we introduce a video salient object tracking network that automatically generates trajectories for prominent foreground objects. These trajectories then serve as prompts for SAM to produce video masks on a frame-by-frame basis. Our experimental results demonstrate that UVOSAM significantly outperforms current mask-supervised methods. These findings suggest that UVOSAM has the potential to improve unsupervised video object segmentation and reduce the cost of manual annotation.
翻译:无监督视频对象分割近年来取得了显著进展,但视频掩码数据集的人工标注成本高昂,且限制了可用数据集的多样性。Segment Anything Model(SAM)引入了一种新的提示驱动图像分割范式,解锁了一系列此前未被探索的能力。本文提出了一种名为UVOSAM的新范式,它利用SAM实现无需视频掩码标签的无监督视频对象分割。为解决SAM在实例发现和身份关联方面的局限性,我们引入了一个视频显著目标跟踪网络,该网络可自动为显著前景对象生成轨迹。这些轨迹随后作为提示输入,使SAM能够逐帧生成视频掩码。实验结果表明,UVOSAM显著优于当前基于掩码监督的方法。这些发现表明,UVOSAM有望改进无监督视频对象分割并降低人工标注成本。