Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

翻译：视频理解模型通常面临计算需求高、参数量大、推理速度慢的挑战，导致其在实际应用中效率低下。为应对这些挑战，我们提出了Mobile-VideoGPT，一种高效的多模态框架，其设计参数量少于十亿。与传统视频大型多模态模型（LMMs）不同，Mobile-VideoGPT由轻量级双视觉编码器、高效投影器和小型语言模型（SLM）组成，能够实现实时吞吐。为进一步提升效率，我们提出了一种基于注意力的帧评分机制来选择关键帧，并采用一种高效的令牌投影器来修剪冗余的视觉令牌，同时保留必要的上下文线索。我们在六个成熟的视频理解基准测试（如MVBench、EgoSchema、NextQA和PercepTest）上评估了我们的模型。结果表明，Mobile-VideoGPT-0.5B每秒可生成多达46个令牌，同时在平均性能上超越现有最先进的0.5B参数模型6个百分点，且参数量减少40%，吞吐量提升超过2倍。我们的代码和模型已在以下网址公开：https://github.com/Amshaker/Mobile-VideoGPT。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日