While CNN-based methods have been the cornerstone of medical image segmentation due to their promising performance and robustness, they suffer from limitations in capturing long-range dependencies. Transformer-based approaches are currently prevailing since they enlarge the reception field to model global contextual correlation. To further extract rich representations, some extensions of the U-Net employ multi-scale feature extraction and fusion modules and obtain improved performance. Inspired by this idea, we propose TransCeption for medical image segmentation, a pure transformer-based U-shape network featured by incorporating the inception-like module into the encoder and adopting a contextual bridge for better feature fusion. The design proposed in this work is based on three core principles: (1) The patch merging module in the encoder is redesigned with ResInception Patch Merging (RIPM). Multi-branch transformer (MB transformer) adopts the same number of branches as the outputs of RIPM. Combining the two modules enables the model to capture a multi-scale representation within a single stage. (2) We construct an Intra-stage Feature Fusion (IFF) module following the MB transformer to enhance the aggregation of feature maps from all the branches and particularly focus on the interaction between the different channels of all the scales. (3) In contrast to a bridge that only contains token-wise self-attention, we propose a Dual Transformer Bridge that also includes channel-wise self-attention to exploit correlations between scales at different stages from a dual perspective. Extensive experiments on multi-organ and skin lesion segmentation tasks present the superior performance of TransCeption compared to previous work. The code is publicly available at \url{https://github.com/mindflow-institue/TransCeption}.
翻译:尽管基于CNN的方法因其出色的性能和鲁棒性而成为医学图像分割的基石,但在捕捉长距离依赖关系方面存在局限性。基于Transformer的方法目前占主导地位,因为它们扩大了感受野以建模全局上下文关联。为了进一步提取丰富的表示,U-Net的一些扩展采用了多尺度特征提取和融合模块,并取得了更好的性能。受此启发,我们提出了用于医学图像分割的TransCeption,这是一种纯基于Transformer的U形网络,其特点是将类Inception模块融入编码器,并采用上下文桥接以实现更好的特征融合。本文提出的设计基于三个核心原则:(1)编码器中的补丁合并模块被重新设计为ResInception补丁合并(RIPM)。多分支Transformer(MB transformer)采用与RIPM输出相同数量的分支。结合这两个模块使模型能够在单个阶段内捕捉多尺度表示。(2)我们在MB transformer之后构建了一个阶段内特征融合(IFF)模块,以增强来自所有分支的特征图的聚合,并特别关注所有尺度不同通道之间的交互。(3)与仅包含令牌级自注意力的桥接不同,我们提出了一种双Transformer桥接,还包括通道级自注意力,以从双重角度利用不同阶段尺度间的相关性。在多器官和皮肤病变分割任务上的大量实验表明,TransCeption相比先前工作具有优越的性能。代码公开于\url{https://github.com/mindflow-institue/TransCeption}。