Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models, and surpasses VideoMamba by clear margins: 7.9% and 8.1% top-1 on Kinetics-400 and Something-Something V2, respectively. Our VideoMambaPro-M model achieves 91.9% top-1 on Kinetics-400, only 0.2% below InternVideo2-6B but with only 1.2% of its parameters. The combination of high performance and efficiency makes VideoMambaPro an interesting alternative for transformer models.
翻译:视频理解需要提取丰富的时空表征,Transformer模型通过自注意力机制实现这一目标。然而,自注意力机制带来了沉重的计算负担。在自然语言处理领域,Mamba已成为Transformer的高效替代方案。但Mamba的成功并不能直接迁移到计算机视觉任务,包括视频分析任务。本文从理论上分析了自注意力机制与Mamba之间的差异。我们识别出Mamba在令牌处理中的两个局限性:历史衰减与元素矛盾。我们提出VideoMambaPro(VMP),通过在VideoMamba骨干网络上添加掩码反向计算与元素残差连接,解决了上述局限性。与Transformer模型相比,VideoMambaPro在视频动作识别任务上展现出最先进的性能,并以显著优势超越VideoMamba:在Kinetics-400和Something-Something V2数据集上分别提升7.9%和8.1%的Top-1准确率。我们的VideoMambaPro-M模型在Kinetics-400上达到91.9%的Top-1准确率,仅比InternVideo2-6B低0.2%,而参数量仅为后者的1.2%。高性能与高效率的结合使VideoMambaPro成为Transformer模型极具潜力的替代方案。