Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models, and surpasses VideoMamba by clear margins: 7.9% and 8.1% top-1 on Kinetics-400 and Something-Something V2, respectively. Our VideoMambaPro-M model achieves 91.9% top-1 on Kinetics-400, only 0.2% below InternVideo2-6B but with only 1.2% of its parameters. The combination of high performance and efficiency makes VideoMambaPro an interesting alternative for transformer models.
翻译:视频理解需要提取丰富的时空表征,而Transformer模型通过自注意力机制实现这一目标。然而,自注意力机制带来了巨大的计算负担。在自然语言处理领域,Mamba已成为Transformer的高效替代方案。但Mamba的成功并未直接迁移到计算机视觉任务,包括视频分析。本文从理论上分析了自注意力机制与Mamba之间的差异,指出了Mamba在令牌处理中存在的两个局限:历史衰减与元素矛盾。我们提出VideoMambaPro模型,通过在VideoMamba骨干网络中引入掩码反向计算与元素残差连接来解决上述局限。实验表明,VideoMambaPro在视频动作识别任务上达到了最先进的性能,相较Transformer模型具有显著优势,并在Kinetics-400和Something-Something V2数据集上分别以7.9%和8.1%的top-1准确率超越原始VideoMamba。其中VideoMambaPro-M模型在Kinetics-400上取得91.9%的top-1准确率,仅比InternVideo2-6B低0.2%,而参数量仅为后者的1.2%。这种高性能与高效率的结合使VideoMambaPro成为替代Transformer模型的有力候选方案。