We introduce VideoMamba, a novel adaptation of the pure Mamba architecture, specifically designed for video recognition. Unlike transformers that rely on self-attention mechanisms leading to high computational costs by quadratic complexity, VideoMamba leverages Mamba's linear complexity and selective SSM mechanism for more efficient processing. The proposed Spatio-Temporal Forward and Backward SSM allows the model to effectively capture the complex relationship between non-sequential spatial and sequential temporal information in video. Consequently, VideoMamba is not only resource-efficient but also effective in capturing long-range dependency in videos, demonstrated by competitive performance and outstanding efficiency on a variety of video understanding benchmarks. Our work highlights the potential of VideoMamba as a powerful tool for video understanding, offering a simple yet effective baseline for future research in video analysis.
翻译:本文提出VideoMamba,一种专为视频识别设计的纯Mamba架构新变体。与依赖自注意力机制、因二次复杂度导致高计算成本的Transformer不同,VideoMamba利用Mamba的线性复杂度与选择性SSM机制实现更高效处理。所提出的时空前向与后向SSM使模型能有效捕捉视频中非序列化空间信息与序列化时序信息间的复杂关联。因此,VideoMamba不仅资源效率高,还能有效捕获视频中的长程依赖,在多种视频理解基准测试中展现出具有竞争力的性能与卓越的效率。我们的工作彰显了VideoMamba作为视频理解强大工具的潜力,为未来视频分析研究提供了简洁而有效的基线。