Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS.
翻译:视频实例分割(VIS)旨在从封闭训练类别集合中对视频中的物体进行分割与分类,缺乏泛化能力以处理真实世界视频中的新类别。为解决这一局限,我们做出以下三项贡献。首先,我们提出了开放词汇视频实例分割这一新任务,旨在同时对开放类别集合(包括训练中未见的新类别)中的视频物体进行分割、跟踪与分类。其次,为对开放词汇VIS进行基准测试,我们收集了一个大规模词汇量视频实例分割数据集(LV-VIS),其中包含来自1196个多样类别的精细标注物体,其类别数量显著超越现有数据集一个数量级以上。第三,我们提出了一种高效的内存诱导Transformer架构OV2Seg,首次以端到端方式实现接近实时推理速度的开放词汇VIS。在LV-VIS及四个现有VIS数据集上的广泛实验表明,OV2Seg在新类别上具有强大的零样本泛化能力。数据集与代码已发布于此:https://github.com/haochenheheda/LVVIS。