The hybrid architecture of convolution neural networks (CNN) and Transformer has been the most popular method for medical image segmentation. However, the existing networks based on the hybrid architecture suffer from two problems. First, although the CNN branch can capture image local features by using convolution operation, the vanilla convolution is unable to achieve adaptive extraction of image features. Second, although the Transformer branch can model the global information of images, the conventional self-attention only focuses on the spatial self-attention of images and ignores the channel and cross-dimensional self-attention leading to low segmentation accuracy for medical images with complex backgrounds. To solve these problems, we propose vision Transformer embrace convolutional neural networks for medical image segmentation (TEC-Net). Our network has two advantages. First, dynamic deformable convolution (DDConv) is designed in the CNN branch, which not only overcomes the difficulty of adaptive feature extraction using fixed-size convolution kernels, but also solves the defect that different inputs share the same convolution kernel parameters, effectively improving the feature expression ability of CNN branch. Second, in the Transformer branch, a (shifted)-window adaptive complementary attention module ((S)W-ACAM) and compact convolutional projection are designed to enable the network to fully learn the cross-dimensional long-range dependency of medical images with few parameters and calculations. Experimental results show that the proposed TEC-Net provides better medical image segmentation results than SOTA methods including CNN and Transformer networks. In addition, our TEC-Net requires fewer parameters and computational costs and does not rely on pre-training. The code is publicly available at https://github.com/SR0920/TEC-Net.
翻译:卷积神经网络(CNN)与Transformer的混合架构已成为医学图像分割领域最主流的方法。然而,现有基于混合架构的网络存在两个问题:其一,尽管CNN分支可通过卷积操作捕获图像局部特征,但标准卷积无法实现图像特征的自适应提取;其二,尽管Transformer分支能建模图像全局信息,但传统自注意力机制仅关注图像的空间自注意力,忽略了通道与跨维度自注意力,导致复杂背景下的医学图像分割精度较低。针对上述问题,我们提出视觉Transformer融合卷积神经网络的医学图像分割网络(TEC-Net)。该网络具有两大优势:首先,在CNN分支中设计动态可变形卷积(DDConv),不仅克服了固定尺寸卷积核难以实现自适应特征提取的困难,更解决了不同输入共享相同卷积核参数的缺陷,有效提升了CNN分支的特征表达能力;其次,在Transformer分支中设计(移位)窗口自适应互补注意力模块((S)W-ACAM)与紧凑型卷积投影模块,使网络能够以少量参数和计算量充分学习医学图像的跨维度长程依赖关系。实验结果表明,所提出的TEC-Net在医学图像分割效果上优于包括CNN和Transformer网络在内的SOTA方法。此外,我们的TEC-Net所需参数和计算成本更少,且无需依赖预训练。代码已在https://github.com/SR0920/TEC-Net公开。