HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer

Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.

翻译：大量研究表明，基于视觉Transformer（ViT）的方法在各种计算机视觉任务中表现出色。然而，ViT模型往往难以有效捕捉图像中的高频成分，而这些成分对于检测小目标和保留边缘细节至关重要，尤其是在复杂场景中。这一局限在结肠息肉分割任务中尤为突出，因为息肉在结构、纹理和形状上表现出显著的多样性。在此背景下，边界细节等高频信息对于实现精确的语义分割至关重要。为应对这些挑战，我们提出了HiFiSeg，一种用于结肠息肉分割的新型网络，它通过全局-局部视觉Transformer框架增强了高频信息处理。HiFiSeg采用金字塔视觉Transformer（PVT）作为编码器，并引入了两个关键模块：全局-局部交互模块（GLIM）和选择性聚合模块（SAM）。GLIM采用并行结构在多个尺度上融合全局与局部信息，有效捕获细粒度特征。SAM选择性地将来自低层特征的边界细节与来自高层特征的语义信息进行整合，显著提升了模型精确检测和分割息肉的能力。在五个广泛认可的基准数据集上进行的大量实验验证了HiFiSeg在息肉分割上的有效性。值得注意的是，在具有挑战性的CVC-ColonDB和ETIS数据集上，mDice分数分别达到了0.826和0.822，这凸显了HiFiSeg在处理该任务特定复杂性方面的卓越性能。