Although video perception models have made remarkable advancements in recent years, they still heavily rely on explicit text descriptions or pre-defined categories to identify target instances before executing video perception tasks. These models, however, fail to proactively comprehend and reason the user's intentions via textual input. Even though previous works attempt to investigate solutions to incorporate reasoning with image segmentation, they fail to reason with videos due to the video's complexity in object motion. To bridge the gap between image and video, in this work, we propose a new video segmentation task - video reasoning segmentation. The task is designed to output tracklets of segmentation masks given a complex input text query. What's more, to promote research in this unexplored area, we construct a reasoning video segmentation benchmark. Finally, we present ViLLa: Video reasoning segmentation with a Large Language Model, which incorporates the language generation capabilities of multimodal Large Language Models (LLMs) while retaining the capabilities of detecting, segmenting, and tracking multiple instances. We use a temporal-aware context aggregation module to incorporate contextual visual cues to text embeddings and propose a video-frame decoder to build temporal correlations across segmentation tokens. Remarkably, our ViLLa demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks. Both quantitative and qualitative experiments show our method effectively unlocks new video reasoning segmentation capabilities for multimodal LLMs. The code and dataset will be available at https://github.com/rkzheng99/ViLLa.
翻译:尽管近年来视频感知模型取得了显著进展,但在执行视频感知任务前,它们仍严重依赖显式文本描述或预定义类别来识别目标实例。然而,这些模型无法通过文本输入主动理解并推理用户的意图。尽管先前的研究尝试探索结合推理与图像分割的解决方案,但由于视频中物体运动的复杂性,这些方法难以对视频进行推理。为弥合图像与视频之间的鸿沟,本研究提出一种新的视频分割任务——视频推理分割。该任务旨在根据复杂的输入文本查询输出分割掩码的轨迹片段。此外,为促进这一未充分探索领域的研究,我们构建了一个推理视频分割基准。最后,我们提出了ViLLa:基于大型语言模型的视频推理分割模型,该模型融合了多模态大型语言模型(LLMs)的语言生成能力,同时保留了检测、分割与跟踪多实例的能力。我们使用时序感知的上下文聚合模块将上下文视觉线索融入文本嵌入,并提出视频帧解码器以在分割标记间建立时序关联。值得注意的是,我们的ViLLa展现出处理复杂推理与指代视频分割的能力。同时,该模型在不同时序理解基准测试中表现出卓越性能。定量与定性实验均表明,我们的方法有效解锁了多模态LLMs在视频推理分割方面的新能力。代码与数据集将在https://github.com/rkzheng99/ViLLa 公开。