Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale self-attention (SA) mechanism. In this paper, we address this issue by introducing a Multi-scale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ, ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with downstream medical image and semantic segmentation tasks.
翻译:Transformer模型在医学图像分割领域取得了显著成功。然而,由于底层采用单一尺度的自注意力机制,Transformer可能表现出有限的泛化能力。本文通过引入多尺度层级视觉Transformer主干网络(MERIT)来解决这一问题,该网络通过在多尺度上计算自注意力来提升模型的泛化性。我们还引入了一种基于注意力的解码器,即级联注意力解码(CASCADE),以进一步优化MERIT生成的多阶段特征。最后,我们提出了一种有效的多阶段特征混合损失聚合方法(MUTATION),通过隐式集成实现更好的模型训练。我们在两个广泛使用的医学图像分割基准(即Synapse多器官数据集、ACDC数据集)上的实验表明,MERIT的性能优于现有最先进方法。我们的MERIT架构和MUTATION损失聚合方法可应用于下游医学图像及语义分割任务。