QMAVIS：基于大型多模态模型融合的长视频-音频理解 (QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models)

Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applica- tions in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like Vide- oLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like PerceptionTest and EgoSchema saw up to 2% improvement, indicating competitive performance. Qualitative experiments also showed that QMAVIS is able to extract the nuances of different scenes in a long video audio content while understanding the overarching narrative. Ablation studies were also conducted to ascertain the impact of each component in the fusion pipeline.

翻译：传统上，用于视频-音频理解的大型多模态模型（LMMs）仅在数分钟长的短视频上进行评估。本文介绍了QMAVIS（Q团队-多模态音频视频智能意义构建），这是一种新颖的长视频-音频理解流程，通过大型多模态模型、大型语言模型和语音识别模型的后期融合构建而成。QMAVIS解决了长视频分析领域的空白，特别是针对数分钟至超过一小时的长视频，为意义构建、视频内容分析、具身人工智能等领域开辟了新的潜在应用。在包含带音频信息长视频的VideoMME（带字幕）数据集上进行的定量实验表明，QMAVIS相较于VideoLlaMA2和InternVL2等最先进的视频-音频LMMs实现了38.75%的性能提升。在其他具有挑战性的视频理解数据集（如PerceptionTest和EgoSchema）上的评估显示，QMAVIS取得了高达2%的性能提升，表明了其具有竞争力的表现。定性实验也显示，QMAVIS能够提取长视频音频内容中不同场景的细微差别，同时理解整体叙事。此外，还进行了消融研究以确定融合流程中每个组件的影响。