HST-MRF: Heterogeneous Swin Transformer with Multi-Receptive Field for Medical Image Segmentation

The Transformer has been successfully used in medical image segmentation due to its excellent long-range modeling capabilities. However, patch segmentation is necessary when building a Transformer class model. This process may disrupt the tissue structure in medical images, resulting in the loss of relevant information. In this study, we proposed a Heterogeneous Swin Transformer with Multi-Receptive Field (HST-MRF) model based on U-shaped networks for medical image segmentation. The main purpose is to solve the problem of loss of structural information caused by patch segmentation using transformer by fusing patch information under different receptive fields. The heterogeneous Swin Transformer (HST) is the core module, which achieves the interaction of multi-receptive field patch information through heterogeneous attention and passes it to the next stage for progressive learning. We also designed a two-stage fusion module, multimodal bilinear pooling (MBP), to assist HST in further fusing multi-receptive field information and combining low-level and high-level semantic information for accurate localization of lesion regions. In addition, we developed adaptive patch embedding (APE) and soft channel attention (SCA) modules to retain more valuable information when acquiring patch embedding and filtering channel features, respectively, thereby improving model segmentation quality. We evaluated HST-MRF on multiple datasets for polyp and skin lesion segmentation tasks. Experimental results show that our proposed method outperforms state-of-the-art models and can achieve superior performance. Furthermore, we verified the effectiveness of each module and the benefits of multi-receptive field segmentation in reducing the loss of structural information through ablation experiments.

翻译：摘要：Transformer凭借其卓越的长程建模能力已成功应用于医学图像分割。然而，构建Transformer类模型时必须进行图像块分割操作。这一过程可能破坏医学图像中的组织结构，导致相关信息丢失。本研究提出一种基于U型网络的异构Swin Transformer多感受野（HST-MRF）模型用于医学图像分割，主要目的是通过融合不同感受野下的图像块信息，解决Transformer图像块分割造成的结构信息丢失问题。核心模块异构Swin Transformer（HST）通过异构注意力实现多感受野图像块信息的交互，并将其传递至下一阶段进行渐进式学习。我们还设计了两阶段融合模块——多模态双线性池化（MBP），辅助HST进一步融合多感受野信息，并结合底层与高层语义信息实现病灶区域的精准定位。此外，我们开发了自适应图像块嵌入（APE）和软通道注意力（SCA）模块，分别在获取图像块嵌入和筛选通道特征时保留更多有价值信息，从而提升模型分割质量。我们在息肉和皮肤病变分割任务的多个数据集上评估了HST-MRF。实验结果表明，所提方法优于现有最优模型，能够实现卓越性能。进一步通过消融实验验证了各模块的有效性以及多感受野分割在减少结构信息丢失方面的优势。