Recent advances in Vision Transformers (ViTs) have significantly enhanced medical image segmentation by facilitating the learning of global relationships. However, these methods face a notable challenge in capturing diverse local and global long-range sequential feature representations, particularly evident in whole-body CT (WBCT) scans. To overcome this limitation, we introduce Swin Soft Mixture Transformer (Swin SMT), a novel architecture based on Swin UNETR. This model incorporates a Soft Mixture-of-Experts (Soft MoE) to effectively handle complex and diverse long-range dependencies. The use of Soft MoE allows for scaling up model parameters maintaining a balance between computational complexity and segmentation performance in both training and inference modes. We evaluate Swin SMT on the publicly available TotalSegmentator-V2 dataset, which includes 117 major anatomical structures in WBCT images. Comprehensive experimental results demonstrate that Swin SMT outperforms several state-of-the-art methods in 3D anatomical structure segmentation, achieving an average Dice Similarity Coefficient of 85.09%. The code and pre-trained weights of Swin SMT are publicly available at https://github.com/MI2DataLab/SwinSMT.
翻译:近年来,视觉Transformer(ViT)通过促进全局关系学习,显著提升了医学图像分割的性能。然而,这些方法在捕获多样化的局部与全局长程序列特征表示方面仍面临显著挑战,这在全身CT(WBCT)扫描中尤为明显。为克服这一局限,我们提出了Swin Soft Mixture Transformer(Swin SMT),这是一种基于Swin UNETR的新型架构。该模型整合了软混合专家(Soft MoE)机制,以有效处理复杂多样的长程依赖关系。Soft MoE的使用使得模型参数得以扩展,同时在训练和推理模式中保持计算复杂度与分割性能之间的平衡。我们在公开的TotalSegmentator-V2数据集上评估了Swin SMT,该数据集包含WBCT图像中的117个主要解剖结构。综合实验结果表明,Swin SMT在三维解剖结构分割任务中优于多种先进方法,平均Dice相似系数达到85.09%。Swin SMT的代码与预训练权重已公开于https://github.com/MI2DataLab/SwinSMT。