BEFUnet: A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation

The accurate segmentation of medical images is critical for various healthcare applications. Convolutional neural networks (CNNs), especially Fully Convolutional Networks (FCNs) like U-Net, have shown remarkable success in medical image segmentation tasks. However, they have limitations in capturing global context and long-range relations, especially for objects with significant variations in shape, scale, and texture. While transformers have achieved state-of-the-art results in natural language processing and image recognition, they face challenges in medical image segmentation due to image locality and translational invariance issues. To address these challenges, this paper proposes an innovative U-shaped network called BEFUnet, which enhances the fusion of body and edge information for precise medical image segmentation. The BEFUnet comprises three main modules, including a novel Local Cross-Attention Feature (LCAF) fusion module, a novel Double-Level Fusion (DLF) module, and dual-branch encoder. The dual-branch encoder consists of an edge encoder and a body encoder. The edge encoder employs PDC blocks for effective edge information extraction, while the body encoder uses the Swin Transformer to capture semantic information with global attention. The LCAF module efficiently fuses edge and body features by selectively performing local cross-attention on features that are spatially close between the two modalities. This local approach significantly reduces computational complexity compared to global cross-attention while ensuring accurate feature matching. BEFUnet demonstrates superior performance over existing methods across various evaluation metrics on medical image segmentation datasets.

翻译：医学图像的精确分割对于各种医疗应用至关重要。卷积神经网络（CNN），特别是全卷积网络（FCN）如U-Net，已在医学图像分割任务中取得显著成功。然而，它们在捕捉全局上下文和长距离关系方面存在局限，尤其是对于形状、尺度和纹理存在显著差异的物体。尽管Transformer在自然语言处理和图像识别中取得了最先进的结果，但由于图像局部性和平移不变性问题，它们在医学图像分割中面临挑战。为解决这些问题，本文提出了一种创新的U形网络BEFUnet，该网络增强了身体与边缘信息的融合，以实现精确的医学图像分割。BEFUnet包含三个主要模块：一种新颖的局部交叉注意力特征（LCAF）融合模块、一种新颖的双层融合（DLF）模块以及双分支编码器。双分支编码器由边缘编码器和身体编码器组成。边缘编码器采用PDC块有效提取边缘信息，而身体编码器使用Swin Transformer通过全局注意力捕捉语义信息。LCAF模块通过选择性地对两种模态间空间接近的特征执行局部交叉注意力，高效融合边缘和身体特征。与全局交叉注意力相比，这种局部方法显著降低了计算复杂度，同时确保了精确的特征匹配。BEFUnet在医学图像分割数据集上的多种评估指标中均展现出优于现有方法的性能。