SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

The Segment Anything Model (SAM) is a powerful foundation model that has revolutionised image segmentation. To apply SAM to surgical instrument segmentation, a common approach is to locate precise points or boxes of instruments and then use them as prompts for SAM in a zero-shot manner. However, we observe two problems with this naive pipeline: (1) the domain gap between natural objects and surgical instruments leads to poor generalisation of SAM; and (2) SAM relies on precise point or box locations for accurate segmentation, requiring either extensive manual guidance or a well-performing specialist detector for prompt preparation, which leads to a complex multi-stage pipeline. To address these problems, we introduce SurgicalSAM, a novel end-to-end efficient-tuning approach for SAM to effectively integrate surgical-specific information with SAM's pre-trained knowledge for improved generalisation. Specifically, we propose a lightweight prototype-based class prompt encoder for tuning, which directly generates prompt embeddings from class prototypes and eliminates the use of explicit prompts for improved robustness and a simpler pipeline. In addition, to address the low inter-class variance among surgical instrument categories, we propose contrastive prototype learning, further enhancing the discrimination of the class prototypes for more accurate class prompting. The results of extensive experiments on both EndoVis2018 and EndoVis2017 datasets demonstrate that SurgicalSAM achieves state-of-the-art performance while only requiring a small number of tunable parameters. The source code will be released at https://github.com/wenxi-yue/SurgicalSAM.

翻译：摘要：分割一切模型（SAM）是一种强大的基础模型，彻底改变了图像分割领域。为将SAM应用于手术器械分割，常见方法是在零样本场景下定位器械的精确点或边界框，并将其作为提示输入SAM。然而，我们观察到这种朴素流程存在两个问题：（1）自然物体与手术器械之间的领域差异导致SAM泛化能力差；（2）SAM依赖精确的点或边界框位置以实现准确分割，这需要大量人工指导或性能优异的专用检测器来准备提示，从而导致复杂的多阶段流程。为解决这些问题，我们提出SurgicalSAM，一种新颖的端到端高效微调方法，旨在将手术领域特定信息与SAM的预训练知识有效整合以提升泛化能力。具体而言，我们提出基于轻量级原型的类别提示编码器进行微调，该编码器直接从类别原型生成提示嵌入，消除了显式提示的使用，从而提升了鲁棒性并简化了流程。此外，针对手术器械类别间方差低的问题，我们提出对比原型学习，进一步增强类原型的区分能力以实现更精确的类别提示。在EndoVis2018和EndoVis2017数据集上的大量实验结果表明，SurgicalSAM在仅需少量可调参数的情况下即达到了最先进性能。源代码将发布于https://github.com/wenxi-yue/SurgicalSAM。