Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have long been the predominant backbone networks for visual representation learning. While ViTs have recently gained prominence over CNNs due to their superior fitting capabilities, their scalability is largely constrained by the quadratic complexity of attention computation. Inspired by the capability of Mamba in efficiently modeling long sequences, we propose VMamba, a generic vision backbone model aiming to reduce the computational complexity to linear while retaining ViTs' advantageous features. To enhance VMamba's adaptability in processing vision data, we introduce the Cross-Scan Module (CSM) to enable 1D selective scanning in 2D image space with global receptive fields. Additionally, we make further improvements in implementation details and architectural designs to enhance VMamba's performance and boost its inference speed. Extensive experimental results demonstrate VMamba's promising performance across various visual perception tasks, highlighting its pronounced advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.
翻译:卷积神经网络(CNN)与视觉Transformer(ViT)长期以来一直是视觉表征学习的核心骨干网络。尽管ViT因其优越的拟合能力近期在性能上超越CNN,但其可扩展性受限于注意力计算的二次复杂度。受Mamba高效建模长序列能力的启发,我们提出VMamba——一种旨在将计算复杂度降至线性级别、同时保留ViT优势特征的通用视觉骨干模型。为增强VMamba处理视觉数据的适应性,我们引入交叉扫描模块(CSM),在二维图像空间中通过全局感受野实现一维选择性扫描。此外,我们通过改进实现细节与架构设计进一步提升VMamba的性能并加速推理。大量实验结果表明,VMamba在各种视觉感知任务中均展现出令人期待的性能,特别是在输入缩放效率方面相较现有基准模型具有显著优势。源代码已开源至https://github.com/MzeroMiko/VMamba。