Fine-grained alignment between videos and text is challenging due to complex spatial and temporal dynamics in videos. Existing video-based Large Multimodal Models (LMMs) handle basic conversations but struggle with precise pixel-level grounding in videos. To address this, we introduce VideoGLaMM, a LMM designed for fine-grained pixel-level grounding in videos based on user-provided textual inputs. Our design seamlessly connects three key components: a Large Language Model, a dual vision encoder that emphasizes both spatial and temporal details, and a spatio-temporal decoder for accurate mask generation. This connection is facilitated via tunable V-L and L-V adapters that enable close Vision-Language (VL) alignment. The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions. To enable fine-grained grounding, we curate a multimodal dataset featuring detailed visually-grounded conversations using a semiautomatic annotation pipeline, resulting in a diverse set of 38k video-QA triplets along with 83k objects and 671k masks. We evaluate VideoGLaMM on three challenging tasks: Grounded Conversation Generation, Visual Grounding, and Referring Video Segmentation. Experimental results show that our model consistently outperforms existing approaches across all three tasks.
翻译:由于视频中复杂的时空动态特性,实现视频与文本之间的细粒度对齐具有挑战性。现有的基于视频的大型多模态模型能够处理基本对话,但在视频中进行精确的像素级定位方面存在困难。为解决此问题,我们提出了VideoGLaMM,这是一种基于用户提供的文本输入、专为视频中细粒度像素级定位而设计的大型多模态模型。我们的设计无缝连接了三个关键组件:一个大型语言模型、一个同时强调空间和时间细节的双视觉编码器,以及一个用于精确掩码生成的时空解码器。这种连接通过可调的V-L和L-V适配器实现,促进了紧密的视觉-语言对齐。该架构经过训练,能够将视频内容的空间和时间元素与文本指令同步。为了实现细粒度定位,我们利用半自动标注流程,精心构建了一个包含详细视觉基础对话的多模态数据集,最终生成了包含38k个视频-问答三元组、83k个对象以及671k个掩码的多样化集合。我们在三个具有挑战性的任务上评估了VideoGLaMM:基于视觉基础的对话生成、视觉定位以及参考视频分割。实验结果表明,我们的模型在所有三项任务上均持续优于现有方法。