Current state-of-the-art Video Object Segmentation (VOS) methods rely on dense per-object mask annotations both during training and testing. This requires time-consuming and costly video annotation mechanisms. We propose a novel Point-VOS task with a spatio-temporally sparse point-wise annotation scheme that substantially reduces the annotation effort. We apply our annotation scheme to two large-scale video datasets with text descriptions and annotate over 19M points across 133K objects in 32K videos. Based on our annotations, we propose a new Point-VOS benchmark, and a corresponding point-based training mechanism, which we use to establish strong baseline results. We show that existing VOS methods can easily be adapted to leverage our point annotations during training, and can achieve results close to the fully-supervised performance when trained on pseudo-masks generated from these points. In addition, we show that our data can be used to improve models that connect vision and language, by evaluating it on the Video Narrative Grounding (VNG) task. We will make our code and annotations available at https://pointvos.github.io.
翻译:当前最先进的视频对象分割方法在训练和测试时均依赖密集的逐对象掩码标注,这需要耗时且昂贵视频标注机制。我们提出一种新颖的Point-VOS任务,采用时空稀疏的点式标注方案,大幅减少标注工作量。我们将该标注方案应用于两个带有文本描述的大规模视频数据集,在32K视频中对133K个对象标注了超过1900万个点。基于这些标注,我们提出新的Point-VOS基准测试及相应的基于点的训练机制,并以此建立强基线结果。我们证明现有VOS方法可轻松适配利用我们的点标注进行训练,当使用这些点生成的伪掩码训练时,其性能接近全监督水平。此外,通过视频叙事定位任务的评估,我们展示该数据可用于改进连接视觉与语言的模型。代码和标注数据将发布在https://pointvos.github.io。