Large Multimodal Models (LMMs) for video-audio understanding have traditionally been evaluated only on shorter videos of a few minutes long. In this paper, we introduce QMAVIS (Q Team-Multimodal Audio Video Intelligent Sensemaking), a novel long video-audio understanding pipeline built through a late fusion of LMMs, Large Language Models, and speech recognition models. QMAVIS addresses the gap in long-form video analytics, particularly for longer videos of a few minutes to beyond an hour long, opening up new potential applica- tions in sensemaking, video content analysis, embodied AI, etc. Quantitative experiments using QMAVIS demonstrated a 38.75% improvement over state-of-the-art video-audio LMMs like Vide- oLlaMA2 and InternVL2 on the VideoMME (with subtitles) dataset, which comprises long videos with audio information. Evaluations on other challenging video understanding datasets like PerceptionTest and EgoSchema saw up to 2% improvement, indicating competitive performance. Qualitative experiments also showed that QMAVIS is able to extract the nuances of different scenes in a long video audio content while understanding the overarching narrative. Ablation studies were also conducted to ascertain the impact of each component in the fusion pipeline.
翻译:传统上,用于视频-音频理解的大型多模态模型(LMMs)仅在数分钟长的短视频上进行评估。本文介绍了QMAVIS(Q团队-多模态音频视频智能意义构建),这是一种新颖的长视频-音频理解流程,通过大型多模态模型、大型语言模型和语音识别模型的后期融合构建而成。QMAVIS解决了长视频分析领域的空白,特别是针对数分钟至超过一小时的长视频,为意义构建、视频内容分析、具身人工智能等领域开辟了新的潜在应用。在包含带音频信息长视频的VideoMME(带字幕)数据集上进行的定量实验表明,QMAVIS相较于VideoLlaMA2和InternVL2等最先进的视频-音频LMMs实现了38.75%的性能提升。在其他具有挑战性的视频理解数据集(如PerceptionTest和EgoSchema)上的评估显示,QMAVIS取得了高达2%的性能提升,表明了其具有竞争力的表现。定性实验也显示,QMAVIS能够提取长视频音频内容中不同场景的细微差别,同时理解整体叙事。此外,还进行了消融研究以确定融合流程中每个组件的影响。