ViLLa: Video Reasoning Segmentation with Large Language Model

Although video perception models have made remarkable advancements in recent years, they still heavily rely on explicit text descriptions or pre-defined categories to identify target instances before executing video perception tasks. These models, however, fail to proactively comprehend and reason the user's intentions via textual input. Even though previous works attempt to investigate solutions to incorporate reasoning with image segmentation, they fail to reason with videos due to the video's complexity in object motion. To bridge the gap between image and video, in this work, we propose a new video segmentation task - video reasoning segmentation. The task is designed to output tracklets of segmentation masks given a complex input text query. What's more, to promote research in this unexplored area, we construct a reasoning video segmentation benchmark. Finally, we present ViLLa: Video reasoning segmentation with a Large Language Model, which incorporates the language generation capabilities of multimodal Large Language Models (LLMs) while retaining the capabilities of detecting, segmenting, and tracking multiple instances. We use a temporal-aware context aggregation module to incorporate contextual visual cues to text embeddings and propose a video-frame decoder to build temporal correlations across segmentation tokens. Remarkably, our ViLLa demonstrates capability in handling complex reasoning and referring video segmentation. Also, our model shows impressive ability in different temporal understanding benchmarks. Both quantitative and qualitative experiments show our method effectively unlocks new video reasoning segmentation capabilities for multimodal LLMs. The code and dataset will be available at https://github.com/rkzheng99/ViLLa.

翻译：尽管近年来视频感知模型取得了显著进展，但在执行视频感知任务前，它们仍严重依赖显式文本描述或预定义类别来识别目标实例。然而，这些模型无法通过文本输入主动理解并推理用户的意图。尽管先前的研究尝试探索结合推理与图像分割的解决方案，但由于视频中物体运动的复杂性，这些方法难以对视频进行推理。为弥合图像与视频之间的鸿沟，本研究提出一种新的视频分割任务——视频推理分割。该任务旨在根据复杂的输入文本查询输出分割掩码的轨迹片段。此外，为促进这一未充分探索领域的研究，我们构建了一个推理视频分割基准。最后，我们提出了ViLLa：基于大型语言模型的视频推理分割模型，该模型融合了多模态大型语言模型（LLMs）的语言生成能力，同时保留了检测、分割与跟踪多实例的能力。我们使用时序感知的上下文聚合模块将上下文视觉线索融入文本嵌入，并提出视频帧解码器以在分割标记间建立时序关联。值得注意的是，我们的ViLLa展现出处理复杂推理与指代视频分割的能力。同时，该模型在不同时序理解基准测试中表现出卓越性能。定量与定性实验均表明，我们的方法有效解锁了多模态LLMs在视频推理分割方面的新能力。代码与数据集将在https://github.com/rkzheng99/ViLLa 公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日