Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5x faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines.
翻译:视频目标分割(Video Object Segmentation, VOS)对于从视频编辑到视频数据生成等多个应用至关重要。训练VOS模型需要大量人工标注的训练视频。传统的对象标注方式要求人工在每一视频帧的目标区域绘制详细的分割掩码,然而这一标注过程既繁琐又耗时。为降低标注成本,本文提出EVA-VOS——一种人机协同的视频目标分割标注框架。与传统方法不同,我们引入了一个智能体,该智能体可迭代预测需要标注的帧("标注什么")以及适用的标注类型("如何标注")。随后,标注人员仅需标注选定的帧,这些标注结果用于更新VOS模块,从而大幅节省标注时间。我们在MOSE和DAVIS数据集上进行了实验,结果表明:(a) EVA-VOS能生成接近人工一致性的掩码,其标注速度比标准视频标注方法快3.5倍;(b) 我们的帧选择方法达到了最先进的性能;(c) 与其他所有方法和基线相比,EVA-VOS在标注时间上具有显著的性能优势。