Current state-of-the-art Video Object Segmentation (VOS) methods rely on dense per-object mask annotations both during training and testing. This requires time-consuming and costly video annotation mechanisms. We propose a novel Point-VOS task with a spatio-temporally sparse point-wise annotation scheme that substantially reduces the annotation effort. We apply our annotation scheme to two large-scale video datasets with text descriptions and annotate over 19M points across 133K objects in 32K videos. Based on our annotations, we propose a new Point-VOS benchmark, and a corresponding point-based training mechanism, which we use to establish strong baseline results. We show that existing VOS methods can easily be adapted to leverage our point annotations during training, and can achieve results close to the fully-supervised performance when trained on pseudo-masks generated from these points. In addition, we show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task. We will make our code and annotations available at https://pointvos.github.io.
翻译:摘要:当前最先进的视频目标分割(VOS)方法在训练和测试阶段均依赖密集的逐对象掩码标注,这需要耗时且昂贵的视频标注机制。我们提出了一种新型的Point-VOS任务,采用时空稀疏的点式标注方案,显著减少了标注工作量。我们将该标注方案应用于两个带有文本描述的大规模视频数据集,在32K个视频中的133K个对象上标注了超过1900万个点。基于这些标注,我们提出了新的Point-VOS基准测试以及对应的基于点的训练机制,并以此建立了强大的基线结果。实验表明,现有的VOS方法可轻松适配以在训练中利用我们的点标注,并在由这些点生成的伪掩码上训练时,能达到接近全监督的性能。此外,我们证明该数据可用于改进连接视觉与语言的模型,并通过在视频叙事定位(VNG)任务上的评估验证了这一点。我们的代码和标注将发布于https://pointvos.github.io。