Convolutional neural networks have made significant strides in medical image analysis in recent years. However, the local nature of the convolution operator inhibits the CNNs from capturing global and long-range interactions. Recently, Transformers have gained popularity in the computer vision community and also medical image segmentation. But scalability issues of self-attention mechanism and lack of the CNN like inductive bias have limited their adoption. In this work, we present MaxViT-UNet, an Encoder-Decoder based hybrid vision transformer for medical image segmentation. The proposed hybrid decoder, also based on MaxViT-block, is designed to harness the power of convolution and self-attention mechanism at each decoding stage with minimal computational burden. The multi-axis self-attention in each decoder stage helps in differentiating between the object and background regions much more efficiently. The hybrid decoder block initially fuses the lower level features upsampled via transpose convolution, with skip-connection features coming from hybrid encoder, then fused features are refined using multi-axis attention mechanism. The proposed decoder block is repeated multiple times to accurately segment the nuclei regions. Experimental results on MoNuSeg dataset proves the effectiveness of the proposed technique. Our MaxViT-UNet outperformed the previous CNN only (UNet) and Transformer only (Swin-UNet) techniques by a large margin of 2.36% and 5.31% on Dice metric respectively.
翻译:近年来,卷积神经网络在医学图像分析领域取得了显著进展。然而,卷积算子的局部性使其难以捕获全局和长距离交互作用。近期,Transformer在计算机视觉领域及医学图像分割中逐渐受到关注,但自注意力机制的可扩展性问题以及缺乏CNN式的归纳偏置限制了其应用。本文提出MaxViT-UNet——一种基于编码器-解码器的混合视觉Transformer用于医学图像分割。所提出的混合解码器同样基于MaxViT模块,旨在以最小计算开销在每个解码阶段充分利用卷积与自注意力机制的能力。解码阶段的多轴自注意力能够更高效地区分目标与背景区域。混合解码器模块首先通过转置卷积上采样的低级特征与来自混合编码器的跳跃连接特征进行融合,随后利用多轴注意力机制对融合特征进行细化。该解码器模块重复多次以精确分割细胞核区域。在MoNuSeg数据集上的实验结果验证了所提技术的有效性。我们的MaxViT-UNet在Dice指标上分别以2.36%和5.31%的显著优势超越了仅使用CNN(UNet)和仅使用Transformer(Swin-UNet)的现有方法。