Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features and 2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (\emph{e.g.}, appearance changes, small objects, rotation) than existing representative methods. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.
翻译:大多数息肉分割方法采用 CNN 作为主干网络,导致编码器与解码器在信息交换时存在两个关键问题:1) 需考虑不同层级特征间的贡献差异;2) 需设计有效的特征融合机制。与现有基于 CNN 的方法不同,我们采用能学习更强健鲁棒表示的 Transformer 编码器。此外,考虑到图像采集影响及息肉本身的隐匿特性,我们引入了三个标准模块:级联融合模块、伪装识别模块和相似度聚合模块。其中,CFM 用于从高层特征中收集息肉的语义与位置信息;CIM 用于捕获隐藏在低层特征中的息肉信息;SAM 通过高层语义位置信息将息肉区域的像素特征扩展至整个息肉区域,从而有效融合跨层级特征。所提出的 Polyp-PVT 模型能有效抑制特征中的噪声,并显著提升其表达能力。在五个广泛采用的数据集上进行的大量实验表明,该模型在多种挑战性场景(如外观变化、小目标、旋转)下均比现有代表性方法更具鲁棒性。该模型开源地址为 https://github.com/DengPingFan/Polyp-PVT。