Mamba, a State Space Model (SSM), has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers in Natural Language Processing and general sequence modeling. Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS). Vision Mamba (VM)-based networks are particularly attractive due to their ability to achieve global receptive fields, similar to Vision Transformers, while also maintaining linear complexity in the number of tokens. However, the existing VM models still struggle to maintain both spatially local and global dependencies of tokens in high dimensional arrays due to their sequential nature. Employing multiple and/or complicated scanning strategies is computationally costly, which hinders applications of SSMs to high-dimensional 2D and 3D images that are common in MIS problems. In this work, we propose Local-Global Vision Mamba, LoG-VMamba, that explicitly enforces spatially adjacent tokens to remain nearby on the channel axis, and retains the global context in a compressed form. Our method allows the SSMs to access the local and global contexts even before reaching the last token while requiring only a simple scanning strategy. Our segmentation models are computationally efficient and substantially outperform both CNN and Transformers-based baselines on a diverse set of 2D and 3D MIS tasks. The implementation of LoG-VMamba is available at \url{https://github.com/Oulu-IMEDS/LoG-VMamba}.
翻译:Mamba作为一种状态空间模型,近期在自然语言处理和通用序列建模任务中展现出与卷积神经网络和Transformer相竞争的性能。已有多种尝试将Mamba适配到计算机视觉任务中,包括医学图像分割。基于视觉Mamba的网络因其能够实现类似视觉Transformer的全局感受野,同时保持与令牌数量呈线性复杂度的特性而备受关注。然而,现有VM模型由于其序列本质,仍难以在高维数组中同时维持令牌的空间局部依赖与全局依赖。采用多重和/或复杂的扫描策略计算成本高昂,这阻碍了SSM在医学图像分割常见的高维2D和3D图像中的应用。本工作提出局部-全局视觉Mamba模型LoG-VMamba,该模型通过在通道轴上显式保持空间相邻令牌的邻近性,并以压缩形式保留全局上下文信息。我们的方法使SSM仅需简单扫描策略即可在未处理完所有令牌时访问局部与全局上下文。所提出的分割模型计算高效,在多种2D和3D医学图像分割任务上显著优于基于CNN和Transformer的基线方法。LoG-VMamba实现代码已发布于\url{https://github.com/Oulu-IMEDS/LoG-VMamba}。