It is expensive and labour-extensive to label the pixel-wise object masks in a video. As a result, the amount of pixel-wise annotations in existing video instance segmentation (VIS) datasets is small, limiting the generalization capability of trained VIS models. An alternative but much cheaper solution is to use bounding boxes to label instances in videos. Inspired by the recent success of box-supervised image instance segmentation, we adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS (BoxVIS) baseline, and observe slight performance degradation. We consequently propose to improve the BoxVIS performance from two aspects. First, we propose a box-center guided spatial-temporal pairwise affinity (STPA) loss to predict instance masks for better spatial and temporal consistency. Second, we collect a larger scale box-annotated VIS dataset (BVISD) by consolidating the videos from current VIS benchmarks and converting images from the COCO dataset to short pseudo video clips. With the proposed BVISD and the STPA loss, our trained BoxVIS model achieves 43.2\% and 29.0\% mask AP on the YouTube-VIS 2021 and OVIS valid sets, respectively. It exhibits comparable instance mask prediction performance and better generalization ability than state-of-the-art pixel-supervised VIS models by using only 16\% of their annotation time and cost. Codes and data can be found at \url{https://github.com/MinghanLi/BoxVIS}.
翻译:标注视频中的逐像素对象掩膜既昂贵又耗费人力。因此,现有视频实例分割数据集中逐像素标注的数量很小,限制了所训练VIS模型的泛化能力。一种替代但成本低得多的方案是使用边界框来标注视频中的实例。受近期框监督图像实例分割成功经验的启发,我们将最先进的像素监督VIS模型适配为框监督VIS基线,并观察到轻微的性能下降。我们进而从两个方面提出改进BoxVIS性能。首先,我们提出一种基于框中心引导的时空成对亲和损失来预测实例掩膜,以提升空间和时间一致性。其次,通过整合现有VIS基准中的视频,并将COCO数据集中的图像转换为短伪视频片段,我们收集了一个更大规模的框标注VIS数据集。利用所提出的BVISD和STPA损失,我们训练的BoxVIS模型在YouTube-VIS 2021和OVIS验证集上分别达到了43.2%和29.0%的掩膜AP。该模型仅使用最先进像素监督VIS模型16%的标注时间和成本,就展现出与之相当的实例掩膜预测性能和更强的泛化能力。代码和数据可在 \url{https://github.com/MinghanLi/BoxVIS} 获取。