Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos-we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.
翻译:以往的视频目标分割(VOS)工作均在密集标注的视频上训练。然而,逐像素级别的标注获取成本高昂且耗时。本研究证明了在稀疏标注视频上训练出令人满意的VOS模型的可行性——仅需每个训练视频包含两帧标注帧即可维持模型性能。我们将这一新型训练范式称为"双帧视频目标分割"(简称双帧VOS)。其核心思想是在训练过程中为未标注帧生成伪标签,并在标注数据与伪标签数据的联合优化下训练模型。该方法极其简洁,可应用于现有的大多数框架。我们首先以半监督方式在稀疏标注视频上预训练VOS模型,其中首帧始终为标注帧;随后利用预训练模型为所有未标注帧生成伪标签并存入伪标签库;最后在无首帧约束条件下,基于标注数据与伪标签数据重新训练VOS模型。本研究首次提出在双帧VOS数据集上训练VOS模型的通用方法。通过仅使用YouTube-VOS与DAVIS基准数据集中7.3%和2.9%的标注数据,我们的方法取得了与全标注集训练方案相当的性能。代码与模型已开源至https://github.com/yk-pku/Two-shot-Video-Object-Segmentation。