Most polyp segmentation methods use CNNs as their backbone, leading to two key issues when exchanging information between the encoder and decoder: 1) taking into account the differences in contribution between different-level features and 2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (\emph{e.g.}, appearance changes, small objects, rotation) than existing representative methods. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.
翻译:大多数息肉分割方法采用CNN作为骨干网络,导致编码器与解码器在信息交换时面临两个关键问题:1)需考虑不同层级特征贡献度的差异;2)需设计有效的多级特征融合机制。与现有基于CNN的方法不同,本文采用Transformer编码器以学习更具表现力与鲁棒性的特征表示。此外,针对图像采集影响及息肉本身难以捕捉的特性,我们引入三个标准模块:级联融合模块(CFM)、伪装识别模块(CIM)与相似性聚合模块(SAM)。其中,CFM用于从高层特征中收集息肉的语义与位置信息;CIM用于捕获隐藏在底层特征中的息肉信息;SAM则通过高层语义位置信息将息肉区域的像素特征扩展至整个息肉区域,从而实现跨层级特征的有效融合。所提出的Polyp-PVT模型可有效抑制特征中的噪声,显著提升特征表达能力。在五个广泛采用的公开数据集上的大量实验表明,与现有代表性方法相比,该模型在应对各类复杂场景(如外观变化、小目标、旋转等)时更具鲁棒性。模型代码已开源于https://github.com/DengPingFan/Polyp-PVT。