In the realm of medical image segmentation, both CNN-based and Transformer-based models have been extensively explored. However, CNNs exhibit limitations in long-range modeling capabilities, whereas Transformers are hampered by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. They not only excel in modeling long-range interactions but also maintain a linear computational complexity. In this paper, leveraging state space models, we propose a U-shape architecture model for medical image segmentation, named Vision Mamba UNet (VM-UNet). Specifically, the Visual State Space (VSS) block is introduced as the foundation block to capture extensive contextual information, and an asymmetrical encoder-decoder structure is constructed with fewer convolution layers to save calculation cost. We conduct comprehensive experiments on the ISIC17, ISIC18, and Synapse datasets, and the results indicate that VM-UNet performs competitively in medical image segmentation tasks. To our best knowledge, this is the first medical image segmentation model constructed based on the pure SSM-based model. We aim to establish a baseline and provide valuable insights for the future development of more efficient and effective SSM-based segmentation systems. Our code is available at https://github.com/JCruan519/VM-UNet.
翻译:在医学图像分割领域,基于CNN和基于Transformer的模型均已得到广泛探索。然而,CNN在长程建模能力方面存在局限,而Transformer则受限于其二次计算复杂度。近年来,以Mamba为代表的状态空间模型(SSMs)已成为一种前景广阔的方法。它们不仅在建模长程交互方面表现优异,还能保持线性计算复杂度。本文利用状态空间模型,提出了一种用于医学图像分割的U形架构模型,命名为视觉Mamba UNet(VM-UNet)。具体而言,我们引入视觉状态空间(VSS)模块作为基础构建块以捕获广泛的上下文信息,并构建了包含较少卷积层的不对称编码器-解码器结构以节省计算成本。我们在ISIC17、ISIC18和Synapse数据集上进行了全面实验,结果表明VM-UNet在医学图像分割任务中具有竞争力。据我们所知,这是首个基于纯SSM模型构建的医学图像分割模型。我们旨在为此建立一个基线,并为未来开发更高效、更有效的基于SSM的分割系统提供有价值的见解。我们的代码公开于https://github.com/JCruan519/VM-UNet。