Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: https://github.com/OpenGVLab/video-mamba-suite.
翻译:视频理解是计算机视觉研究的基本方向之一,大量研究致力于探索RNN、3D CNN和Transformer等不同架构。新提出的状态空间模型架构(如Mamba)展现出将长序列建模的成功经验拓展至视频建模的潜力。为评估Mamba能否在视频理解领域成为Transformer的有效替代方案,本文开展了一系列全面研究,探讨Mamba在视频建模中可能扮演的不同角色,并探索Mamba可能表现出优越性的多样化任务。我们将Mamba在视频建模中的角色归纳为四类,构建了由14个模型/模块组成的Video Mamba Suite,并在12个视频理解任务上对其进行评估。大量实验表明,Mamba在纯视频任务和视频-语言任务中均展现出强劲潜力,同时实现了效率与性能的良好平衡。我们希望这项工作能为未来视频理解研究提供有价值的数据基准与洞见。代码已公开:https://github.com/OpenGVLab/video-mamba-suite。