Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

With the growing scale and complexity of video data, efficiently processing long video sequences poses significant challenges due to the quadratic increase in memory and computational demands associated with existing transformer-based Large Multi-modal Models (LMMs). To address these issues, we introduce Video-Ma$^2$mba, a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework, replacing the attention mechanisms. This allows the LMMs to scale linearly in terms of time and memory requirements, making it feasible to handle long-duration video content. Furthermore, we enhance the memory efficiency introducing the Multi-Axis Gradient Checkpointing (MA-GC) method, which strategically manages memory by retaining only essential activations across multiple computational axes. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. Empirical analyses show that Video-Ma$^2$mba can process extensive video sequences-equivalent to millions of tokens or over two hours of continuous sequences at 1 FPS-on a single GPU. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks, demonstrating substantial advantages over existing frameworks.

翻译：随着视频数据的规模与复杂度日益增长，现有基于Transformer的大型多模态模型在处理长视频序列时面临内存与计算需求呈二次增长的严峻挑战。为应对这些问题，我们提出了Video-Ma$^2$mba——一种在Mamba-2框架中引入状态空间模型以替代注意力机制的新型架构。该设计使大型多模态模型在时间与内存需求上实现线性扩展，从而能够有效处理长时视频内容。此外，我们通过提出多轴梯度检查点方法进一步提升了内存效率，该方法通过在多计算轴上仅保留关键激活值来实现策略性内存管理。相较于标准梯度检查点技术，我们的方法显著降低了内存占用。实证分析表明，Video-Ma$^2$mba能够在单GPU上处理相当于百万级token或1 FPS帧率下超过两小时的连续视频序列。通过精细捕捉时序动态特征，我们的模型在长视频理解任务中提升了响应准确性与相关性，展现出相较于现有框架的显著优势。

相关内容

工商管理硕士（MBA）

关注 3

MBA是英文Master of Business Administration（工商管理硕士）的简称，而其中文简称为“工管硕”。工管硕士是源于欧美国家的一种专门培养中高级职业经理人员的专业硕士学位。工管硕士是市场经济的产物，培养的是高素质的管理人员、职业经理人和创业者。工管硕士是商业界普遍认为是晋身管理阶层的一块垫脚石。现时不少学校为了开拓财源增加收入，都与世界知名大学商学院学术合作，销售他们的工商管理硕士课程。工管硕士培养的是高质量的职业工商管理人才，使他们掌握生产、财务、金融、营销、经济法规、国际商务等多学科知识
和管理技能。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日