With the growing scale and complexity of video data, efficiently processing long video sequences poses significant challenges due to the quadratic increase in memory and computational demands associated with existing transformer-based Large Multi-modal Models (LMMs). To address these issues, we introduce Video-Ma$^2$mba, a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework, replacing the attention mechanisms. This allows the LMMs to scale linearly in terms of time and memory requirements, making it feasible to handle long-duration video content. Furthermore, we enhance the memory efficiency introducing the Multi-Axis Gradient Checkpointing (MA-GC) method, which strategically manages memory by retaining only essential activations across multiple computational axes. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. Empirical analyses show that Video-Ma$^2$mba can process extensive video sequences-equivalent to millions of tokens or over two hours of continuous sequences at 1 FPS-on a single GPU. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks, demonstrating substantial advantages over existing frameworks.
翻译:随着视频数据的规模与复杂度日益增长,现有基于Transformer的大型多模态模型在处理长视频序列时面临内存与计算需求呈二次增长的严峻挑战。为应对这些问题,我们提出了Video-Ma$^2$mba——一种在Mamba-2框架中引入状态空间模型以替代注意力机制的新型架构。该设计使大型多模态模型在时间与内存需求上实现线性扩展,从而能够有效处理长时视频内容。此外,我们通过提出多轴梯度检查点方法进一步提升了内存效率,该方法通过在多计算轴上仅保留关键激活值来实现策略性内存管理。相较于标准梯度检查点技术,我们的方法显著降低了内存占用。实证分析表明,Video-Ma$^2$mba能够在单GPU上处理相当于百万级token或1 FPS帧率下超过两小时的连续视频序列。通过精细捕捉时序动态特征,我们的模型在长视频理解任务中提升了响应准确性与相关性,展现出相较于现有框架的显著优势。