We propose and study a new computer vision task named open-vocabulary video instance segmentation (OpenVIS), which aims to simultaneously segment, detect, and track arbitrary objects in a video according to corresponding text descriptions. Compared to the original video instance segmentation, OpenVIS enables users to identify objects of desired categories, regardless of whether those categories were included in the training dataset. To achieve this goal, we propose a two-stage pipeline for proposing high-quality class-agnostic object masks and predicting their corresponding categories via pre-trained VLM. Specifically, we first employ a query-based mask proposal network to generate masks of all potential objects, where we replace the original class head with an instance head trained with a binary object loss, thereby enhancing the class-agnostic mask proposal ability. Then, we introduce a proposal post-processing approach to adapt the proposals better to the pre-trained VLMs, avoiding distortion and unnatural proposal inputs. Meanwhile, to facilitate research on this new task, we also propose an evaluation benchmark that utilizes off-the-shelf datasets to comprehensively assess its performance. Experimentally, the proposed OpenVIS exhibits a remarkable 148\% improvement compared to the full-supervised baselines on BURST, which have been trained on all categories.
翻译:我们提出并研究了一项名为开放词汇视频实例分割(OpenVIS)的新型计算机视觉任务,其目标是根据相应的文本描述,同时在视频中分割、检测和跟踪任意物体。与原始的视频实例分割相比,OpenVIS允许用户识别所需类别的物体,无论这些类别是否包含在训练数据集中。为实现这一目标,我们提出了一个两阶段流程:首先生成高质量与类别无关的物体掩码,然后通过预训练视觉语言模型(VLM)预测其对应类别。具体地,我们首先采用基于查询的掩码提议网络生成所有潜在物体的掩码,并将原始的分类头替换为使用二元物体损失训练的实例头,从而增强与类别无关的掩码提议能力。随后,我们引入提议后处理方法,使提议更好地适配预训练的VLM,避免扭曲和非自然的提议输入。同时,为促进该新任务的研究,我们还构建了一个利用现有数据集全面评估其性能的基准测试。实验结果表明,所提出的OpenVIS在BURST数据集上相比基于全类别训练的全监督基线方法,性能提升了148%。