MaxViT-UNet: Multi-Axis Attention for Medical Image Segmentation

Convolutional Neural Networks (CNNs) have made significant strides in medical image analysis in recent years. However, the local nature of the convolution operator may pose a limitation for capturing global and long-range interactions in CNNs. Recently, Transformers have gained popularity in the computer vision community and also medical image segmentation due to their ability to process global features effectively. The scalability issues of self-attention mechanism and lack of the CNN-like inductive bias may have limited their adoption. Therefore, hybrid Vision transformers (CNN-Transformer), exploiting advantages of both Convolution and Self-attention Mechanisms, have gained importance. In this work, we present MaxViT-UNet, an Encoder-Decoder based hybrid vision transformer (CNN-Transformer) for medical image segmentation. The proposed Hybrid Decoder, based on MaxViT-block, is designed to harness the power of both the convolution and self-attention mechanisms at each decoding stage with nominal computational burden. The inclusion of multi-axis self-attention, within each decoder stage, significantly enhances the discriminating capacity between the object and background regions, and thereby helps in improving the segmentation efficiency. In the Hybrid Decoder block, the fusion process commences by integrating the upsampled lower level decoder features, obtained through transpose convolution, with the skip-connection features derived from the hybrid encoder. Subsequently, the fused features undergo refinement through the utilization of a multi-axis attention mechanism. The proposed decoder block is repeated multiple times to progressively segment the nuclei regions. Experimental results on MoNuSeg18 and MoNuSAC20 dataset demonstrates the effectiveness of the proposed technique.

翻译：近年来，卷积神经网络在医学图像分析领域取得了显著进展。然而，卷积算子的局部性限制了其捕捉全局及长程交互的能力。近期，Transformer因其高效处理全局特征的能力，在计算机视觉及医学图像分割领域广受关注。但自注意力机制的可扩展性问题以及缺乏类似CNN的归纳偏置可能限制了其应用。因此，兼具卷积与自注意力机制优势的混合视觉Transformer（CNN-Transformer）变得日益重要。本文提出MaxViT-UNet——一种基于编码器-解码器的混合视觉Transformer（CNN-Transformer）用于医学图像分割。所提出的混合解码器基于MaxViT模块，旨在以较低计算开销在每个解码阶段融合卷积与自注意力机制的优势。在每个解码阶段引入多轴自注意力，显著增强了目标区域与背景区域的区分能力，从而提升了分割效率。在混合解码器模块中，融合过程首先通过转置卷积对低层级解码器特征进行上采样，随后与混合编码器的跳跃连接特征进行整合；接着利用多轴注意力机制对融合特征进行精细化处理。所提解码器模块经多次重复，逐步实现细胞核区域的分割。在MoNuSeg18和MoNuSAC20数据集上的实验结果验证了该方法的有效性。