Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.
翻译:视频理解模型常因高计算需求、庞大参数量及缓慢推理速度而难以实际应用。为应对这些挑战,我们提出Mobile-VideoGPT——一个高效的 multimodal 框架,能在低于十亿参数下运行。与传统视频大型多模态模型不同,Mobile-VideoGPT 由轻量级双视觉编码器、高效投影器和一个小型语言模型组成,可实现实时吞吐。为进一步提升效率,我们提出基于注意力的帧评分机制以选择关键帧,并配备高效令牌投影器,修剪冗余视觉令牌同时保留关键上下文信息。我们在六个成熟视频理解基准(如MVBench、EgoSchema、NextQA 及 PercepTest)上评估模型。结果显示,Mobile-VideoGPT-0.5B 每秒可生成高达46个令牌,在平均性能上超越现有最先进的0.5B参数模型6个百分点,同时减少40%参数且吞吐量提升逾两倍。我们的代码和模型已开源:https://github.com/Amshaker/Mobile-VideoGPT。