Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar,Xiaohan Wang,Yann Dubois,Nikhil Mehta,Tong Xiao,Philippe Hansen-Estruch,Licheng Yu,Xiaofang Wang,Felix Juefei-Xu,Ning Zhang,Serena Yeung-Levy,Xide Xia

from arxiv, https://apollo-lmms.github.io

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

翻译：尽管视频感知能力正迅速融入大型多模态模型（LMMs），但其驱动视频理解的内在机制仍不甚明晰。因此，该领域的许多设计决策缺乏充分的论证与分析。此类模型训练与评估的高昂计算成本，加之公开研究的匮乏，严重制约了视频-LMMs的发展。为应对这一挑战，我们开展了一项系统性研究，旨在揭示LMMs中有效驱动视频理解的关键因素。我们首先批判性地审视了视频-LMM研究所需高计算资源的主要成因，并发现了"缩放一致性"现象——即在较小规模模型与数据集（达到临界规模前）上作出的设计与训练决策，能有效迁移至更大规模的模型。基于这些发现，我们深入探索了视频-LMMs中诸多视频相关的特性，包括视频采样策略、架构设计、数据构成、训练方案等。例如，我们论证了训练期间采用fps采样远优于均匀帧采样，并明确了何种视觉编码器最适用于视频表征。在这些发现的指导下，我们提出了阿波罗（Apollo）——一个在不同模型规模上均实现卓越性能的先进LMM系列。该系列模型能高效处理长达数小时的视频，其中Apollo-3B在LongVideoBench基准测试中以55.1分的优异表现超越多数现有$7$B模型。Apollo-7B相较于同类7B规模LMMs达到领先水平，在MLVU和Video-MME基准上分别取得70.9分与63.3分的成绩。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日