Video Instance Segmentation(VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset(LV-VIS), that contains well-annotated objects from 1,212 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Vision-Language Transformer, MindVLT, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of MindVLT on novel categories. We will release the dataset and code to facilitate future endeavors.
翻译:视频实例分割旨在从封闭的训练类别集合中对视频中的物体进行分割与分类,缺乏处理现实视频中新类别的泛化能力。为解决这一局限,我们做出以下三点贡献。首先,我们提出了开放词汇视频实例分割这一新任务,其目标是同时从开放类别集合(包括训练中未见的新类别)中对视频中的物体进行分割、跟踪与分类。其次,为对开放词汇VIS进行基准测试,我们构建了大规模词汇视频实例分割数据集(LV-VIS),该数据集包含来自1212个不同类别的精确标注物体,其类别规模显著超过现有数据集的一个数量级以上。第三,我们提出高效的内存引导视觉-语言Transformer(MindVLT),首次以端到端方式实现接近实时推理速度的开放词汇VIS。在LV-VIS和四个现有VIS数据集上的大量实验表明,MindVLT在新类别上具有强大的零样本泛化能力。我们将公开数据集和代码以促进后续研究。