Large Language Models (LLMs) demonstrate remarkable proficiency in comprehending and handling text-based tasks. Many efforts are being made to transfer these attributes to video modality, which are termed Video-LLMs. However, existing Video-LLMs can only capture the coarse-grained semantics and are unable to effectively handle tasks related to comprehension or localization of specific video segments. In light of these challenges, we propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks. To support the training of Momentor, we design an automatic data generation engine to construct Moment-10M, a large-scale video instruction dataset with segment-level instruction data. We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization. Zero-shot evaluations on several tasks demonstrate that Momentor excels in fine-grained temporally grounded comprehension and localization.
翻译:大语言模型(LLMs)在理解和处理文本任务方面展现出卓越能力。当前许多研究致力于将这些特性迁移至视频模态,这类模型被统称为视频大语言模型(Video-LLMs)。然而,现有视频大语言模型仅能捕捉粗粒度语义信息,无法有效处理涉及特定视频片段理解或定位的任务。针对这些挑战,我们提出Momentor——一种能够完成细粒度时序理解任务的视频大语言模型。为支持Momentor的训练,我们设计了自动数据生成引擎来构建Moment-10M,这是一个包含片段级指令数据的大规模视频指令数据集。通过在Moment-10M上训练,Momentor实现了片段级推理与定位能力。在多项任务上的零样本评估表明,Momentor在细粒度时序基础理解与定位方面表现优异。