The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.
翻译:Transformer和基于Mamba架构的出现,通过实现全局上下文建模显著推进了三维医学图像分割,而该能力传统上受限于卷积神经网络。然而,最先进的Transformer模型通常具有巨大的计算复杂性和参数量,这对体数据尤其不利,且因标注医学影像数据集的匮乏而进一步加剧。为应对这些限制,本文提出SegMaFormer,一种轻量级混合架构,在分层体编码器中协同使用Mamba和Transformer模块,以实现高效的长距离依赖建模。该模型策略性地在早期高分辨率阶段采用基于Mamba的层以降低计算开销并捕获必要的空间上下文,而在后期低分辨率阶段保留自注意力机制以优化特征表示。该设计辅以广义旋转位置嵌入以增强空间感知能力。尽管结构紧凑,SegMaFormer在三个公开基准(Synapse、BraTS和ACDC)上取得了具有竞争力的性能,与显著更大的模型匹配Dice系数。实验表明,与当前最先进模型相比,本方法参数减少高达75倍且大幅降低FLOPs,为三维医学图像分割建立了一个高效且高性能的解决方案。