Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.
翻译:自动息肉分割对于提升结直肠癌(CRC)的临床识别至关重要。尽管深度学习(DL)技术在该问题上已得到广泛研究,但现有方法常面临泛化能力不足的挑战,尤其在数据受限或复杂场景中。此外,许多现有息肉分割方法依赖于复杂的任务专用架构。为突破这些局限,我们提出一种利用DINO自注意力"键"特征内在鲁棒性实现稳健分割的框架。与传统方法从Vision Transformer(ViT)最深层次提取令牌不同,本方法通过自注意力模块的键特征结合简易卷积解码器预测息肉掩膜,从而提升性能并增强泛化能力。我们在多中心数据集上采用两种严格协议——领域泛化(DG)与极端单领域泛化(ESDG)——验证本方法。综合统计分析表明,该流程实现了最先进的性能,显著增强了在数据稀缺和复杂场景下的泛化能力。在避免使用息肉专用架构的同时,本方法超越了nnU-Net和UM-Net等成熟模型。此外,我们系统性地评估了DINO框架的演进历程,量化了架构改进对下游息肉分割性能的具体影响。