Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
翻译:近年来,大型语言模型(LLMs)已扩展至视频领域,实现了复杂的视频-语言理解。然而,现有的视频大语言模型通常在细粒度时序推理方面存在局限,限制了其将响应精确归因于特定视频片段的能力,尤其是在监督信息受限的情况下。我们提出了DaMO,一种专为精确时序推理与多模态理解而设计的数据高效视频大语言模型。其核心是所提出的时序感知融合变换器,它采用分层双流架构,逐步捕捉各模态内的时序动态,并有效融合互补的视觉与音频信息。为进一步提升计算效率,DaMO集成了一个全局残差模块,以减少空间冗余,同时保留关键的语义细节。我们通过结构化的四阶段渐进式训练范式训练DaMO,逐步赋予模型多模态对齐、语义接地和时序推理能力。本研究还贡献了多个从现有数据集增强而来的数据集,其中包含由LLM生成的、具有时序标注的问答对,适用于需要时序监督的任务。在时序定位和视频问答基准上的综合实验表明,DaMO在需要精确时序对齐与推理的任务中,持续超越先前的方法。我们的工作为数据高效的视频-语言建模确立了一个有前景的方向。