We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.
翻译:我们提出了“开放世界视频实例分割与描述”这一新任务。该任务要求对从未见过的物体进行检测、分割、跟踪,并用丰富的描述性文字进行描述。这一挑战性任务可通过开发连接视觉模型与语言基础模型的“抽象器”来解决。具体而言,我们通过开发物体抽象器与物体到文本抽象器,将多尺度视觉特征提取器与大型语言模型(LLM)相连接。物体抽象器由提示编码器和Transformer模块构成,通过引入空间多样化的开放世界物体查询,以发现视频中从未见过的物体。查询间对比损失进一步促进了物体查询的多样性。物体到文本抽象器则通过掩码交叉注意力进行增强,充当物体查询与冻结LLM之间的桥梁,为每个检测到的物体生成丰富且描述性的以物体为中心的描述。我们的通用化方法在从未见过物体上的性能超越联合处理开放世界视频实例分割与密集视频物体描述任务的基线13%,在以物体为中心的描述任务上超越基线10%。