The recently introduced Segment-Anything Model (SAM) has the potential to greatly accelerate the development of segmentation models. However, directly applying SAM to surgical images has key limitations including (1) the requirement of image-specific prompts at test-time, thereby preventing fully automated segmentation, and (2) ineffectiveness due to substantial domain gap between natural and surgical images. In this work, we propose CycleSAM, an approach for one-shot surgical scene segmentation that uses the training image-mask pair at test-time to automatically identify points in the test images that correspond to each object class, which can then be used to prompt SAM to produce object masks. To produce high-fidelity matches, we introduce a novel spatial cycle-consistency constraint that enforces point proposals in the test image to rematch to points within the object foreground region in the training image. Then, to address the domain gap, rather than directly using the visual features from SAM, we employ a ResNet50 encoder pretrained on surgical images in a self-supervised fashion, thereby maintaining high label-efficiency. We evaluate CycleSAM for one-shot segmentation on two diverse surgical semantic segmentation datasets, comprehensively outperforming baseline approaches and reaching up to 50% of fully-supervised performance.
翻译:最近提出的Segment-Anything Model(SAM)具有极大加速分割模型发展的潜力。然而,直接将SAM应用于手术图像存在关键局限,包括:(1)测试时需要针对特定图像的提示,从而阻碍了全自动分割的实现;(2)由于自然图像与手术图像之间存在显著的领域差异,导致模型效果不佳。在本工作中,我们提出CycleSAM,一种用于单次手术场景分割的方法,该方法在测试时利用训练图像-掩码对自动识别测试图像中对应于每个物体类别的点,这些点随后可用于提示SAM生成物体掩码。为产生高保真度的匹配,我们引入了一种新颖的空间循环一致性约束,强制要求测试图像中的点提议重新匹配到训练图像中物体前景区域内的点。接着,为应对领域差异,我们并未直接使用SAM的视觉特征,而是采用在手术图像上以自监督方式预训练的ResNet50编码器,从而保持较高的标签效率。我们在两个多样化的手术语义分割数据集上评估CycleSAM的单次分割性能,其全面超越了基线方法,并达到了全监督性能的50%。