Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.
翻译:长视频(时长从数分钟到数小时)因其复杂的事件、多样的场景和长程依赖关系,对当前的多模态大语言模型(MLLMs)构成了重大挑战。对此类视频进行直接编码的计算成本过高,而简单的视频到文本转换则往往导致内容冗余或碎片化。为应对这些局限性,我们提出了MMViR,一种用于长视频理解的新型多模态、多粒度结构化表征。MMViR通过识别关键转折点对视频进行分割,并构建了一个将全局叙事与细粒度视觉细节相结合的三级描述。该设计支持高效的基于查询的检索,并能很好地泛化到各种场景。在包括问答、摘要和检索在内的三项任务上进行广泛评估的结果表明,MMViR优于先前的最强方法,在一小时时长视频理解任务上实现了19.67%的性能提升,同时将处理延迟降低至原始方法的45.4%。