State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.
翻译:基于Transformer的先进大型多模态模型(LMMs)在处理小时级长视频输入时面临挑战,这主要源于因果自注意力操作的二次复杂度,导致训练和推理过程中的计算成本高昂。现有的基于令牌压缩的方法虽然减少了视频令牌数量,但往往伴随信息损失,且对极长序列的处理效率仍然不足。本文探索了一种正交方向,构建了一种混合Mamba-Transformer模型(VAMBA),该模型采用Mamba-2模块以线性复杂度编码视频令牌。在不进行任何令牌压缩的情况下,VAMBA可在单GPU上编码超过1024帧(640×360)的视频,而基于Transformer的模型仅能编码256帧。在处理长视频输入时,VAMBA在训练和推理阶段的GPU内存使用量至少降低50%,且每个训练步骤的速度相比基于Transformer的LMMs提升近一倍。实验结果表明,在具有挑战性的小时级视频理解基准LVBench上,VAMBA相比先前的高效视频LMMs准确率提升4.3%,并在广泛的长视频与短视频理解任务中保持强劲性能。