Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) stand as the two most popular foundation models for visual representation learning. While CNNs exhibit remarkable scalability with linear complexity w.r.t. image resolution, ViTs surpass them in fitting capabilities despite contending with quadratic complexity. A closer inspection reveals that ViTs achieve superior visual modeling performance through the incorporation of global receptive fields and dynamic weights. This observation motivates us to propose a novel architecture that inherits these components while enhancing computational efficiency. To this end, we draw inspiration from the recently introduced state space model and propose the Visual State Space Model (VMamba), which achieves linear complexity without sacrificing global receptive fields. To address the encountered direction-sensitive issue, we introduce the Cross-Scan Module (CSM) to traverse the spatial domain and convert any non-causal visual image into order patch sequences. Extensive experimental results substantiate that VMamba not only demonstrates promising capabilities across various visual perception tasks, but also exhibits more pronounced advantages over established benchmarks as the image resolution increases. Source code has been available at https://github.com/MzeroMiko/VMamba.
翻译:卷积神经网络(CNN)与视觉Transformer(ViT)是当前视觉表征学习中最主流的两种基础模型。尽管CNN在图像分辨率上展现出线性复杂度的显著可扩展性,而ViT虽面临二次复杂度挑战,却在拟合能力上更胜一筹。深入研究发现,ViT通过融合全局感受野与动态权重机制实现了卓越的视觉建模性能。这一发现启发我们提出一种继承这些特性但进一步提升计算效率的新型架构。为此,我们借鉴最新引入的状态空间模型,提出了视觉状态空间模型(VMamba),该模型在保持全局感受野的同时实现了线性复杂度。为应对方向敏感性问题,我们引入跨扫描模块(CSM),通过遍历空间域将非因果视觉图像转化为有序补丁序列。大量实验结果表明,VMamba不仅在各类视觉感知任务中展现出卓越性能,且随着图像分辨率的提升,其相较现有基准模型的优势更为显著。开源代码已发布于 https://github.com/MzeroMiko/VMamba。