In this work, we present MaxViT-UNet, an Encoder-Decoder based hybrid vision transformer (CNN-Transformer) for medical image segmentation. The proposed Hybrid Decoder, based on MaxViT-block, is designed to harness the power of both the convolution and self-attention mechanisms at each decoding stage with a nominal memory and computational burden. The inclusion of multi-axis self-attention, within each decoder stage, significantly enhances the discriminating capacity between the object and background regions, thereby helping in improving the segmentation efficiency. In the Hybrid Decoder block, the fusion process commences by integrating the upsampled lower-level decoder features, obtained through transpose convolution, with the skip-connection features derived from the hybrid encoder. Subsequently, the fused features undergo refinement through the utilization of a multi-axis attention mechanism. The proposed decoder block is repeated multiple times to progressively segment the nuclei regions. Experimental results on MoNuSeg18 and MoNuSAC20 dataset demonstrates the effectiveness of the proposed technique. Our MaxViT-UNet outperformed the previous CNN-based (UNet) and Transformer-based (Swin-UNet) techniques by a considerable margin on both of the standard datasets. The following github (https://github.com/PRLAB21/MaxViT-UNet) contains the implementation and trained weights.
翻译:本文提出MaxViT-UNet,一种基于编码器-解码器的混合视觉Transformer(CNN-Transformer)架构,用于医学图像分割。所提出的基于MaxViT块的混合解码器,旨在以较低的内存和计算开销,在每个解码阶段同时利用卷积与自注意力机制的效能。在解码器各阶段引入多轴自注意力,显著增强了目标区域与背景区域的判别能力,从而有助于提升分割效率。在混合解码器块中,融合过程首先通过转置卷积获取上采样的低层解码器特征,并将其与混合编码器导出的跳跃连接特征进行整合;随后,采用多轴注意力机制对融合特征进行精炼。所提出的解码器块经多次重复以逐步分割细胞核区域。在MoNuSeg18和MoNuSAC20数据集上的实验结果表明了所提技术的有效性。我们的MaxViT-UNet在两个标准数据集上均显著超越了此前基于CNN的(UNet)和基于Transformer的(Swin-UNet)方法。以下GitHub仓库(https://github.com/PRLAB21/MaxViT-UNet)包含了实现代码与预训练权重。