Although transformer is preferred in natural language processing, few studies have applied it in the field of medical imaging. For its long-term dependency, the transformer is expected to contribute to unconventional convolution neural net conquer their inherent spatial induction bias. The lately suggested transformer-based partition method only uses the transformer as an auxiliary module to help encode the global context into a convolutional representation. There is hardly any study about how to optimum bond self-attention (the kernel of transformers) with convolution. To solve the problem, the article proposes MS-Twins (Multi-Scale Twins), which is a powerful segmentation model on account of the bond of self-attention and convolution. MS-Twins can better capture semantic and fine-grained information by combining different scales and cascading features. Compared with the existing network structure, MS-Twins has made significant progress on the previous method based on the transformer of two in common use data sets, Synapse and ACDC. In particular, the performance of MS-Twins on Synapse is 8% higher than SwinUNet. Even compared with nnUNet, the best entirely convoluted medical image segmentation network, the performance of MS-Twins on Synapse and ACDC still has a bit advantage.
翻译:尽管Transformer在自然语言处理领域备受青睐,但在医学影像领域的应用研究仍较为有限。由于具备长程依赖特性,Transformer有望突破传统卷积神经网络固有的空间归纳偏置局限。近期提出的基于Transformer的分割方法仅将其作为辅助模块,用于将全局上下文编码至卷积表征中,而关于如何最优地结合自注意力(Transformer核心机制)与卷积的研究几乎空白。针对该问题,本文提出MS-Twins(多尺度双胞胎网络),这是一种基于自注意力与卷积结合的高效分割模型。MS-Twins通过融合不同尺度与级联特征,能够更有效地捕获语义信息与细粒度特征。与现有网络结构相比,MS-Twins在Synapse和ACDC两个常用数据集上均超越了此前基于Transformer的方法,尤其在Synapse数据集上性能较SwinUNet提升8%。即便与最优的全卷积医学图像分割网络nnUNet相比,MS-Twins在Synapse和ACDC数据集上仍保持微弱优势。