Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.
翻译:近期状态空间模型(尤其是Mamba)在语言理解等长序列建模任务中取得了显著进展。然而,其在视觉任务中的应用尚未明显超越传统卷积神经网络(CNN)与视觉Transformer(ViT)的性能。本文指出,提升视觉Mamba(ViM)性能的关键在于优化序列建模的扫描方向。传统ViM方法通过展平空间令牌处理图像,却忽视了局部二维依赖关系的保持,导致相邻令牌间距离被拉长。为此,我们提出一种新颖的局部扫描策略,将图像划分为不同窗口,在保持全局视野的同时有效捕获局部依赖关系。此外,考虑到不同网络层对扫描模式的偏好存在差异,我们设计了一种动态方法,可独立搜索每层的最优扫描选择,从而显著提升性能。在平面模型与层级模型上的大量实验均表明,我们的方法在图像表征捕获方面具有优越性。例如,在相同1.5G FLOPs条件下,我们的模型在ImageNet上以3.1%的绝对优势显著超越Vim-Ti。代码开源地址:https://github.com/hunto/LocalMamba。