Part to Whole: Collaborative Prompting for Surgical Instrument Segmentation

Foundation models like the Segment Anything Model (SAM) have demonstrated promise in generic object segmentation. However, directly applying SAM to surgical instrument segmentation presents key challenges. First, SAM relies on per-frame point-or-box prompts which complicate surgeon-computer interaction. Also, SAM yields suboptimal performance on segmenting surgical instruments, owing to insufficient surgical data in its pre-training as well as the complex structure and fine-grained details of various surgical instruments. To address these challenges, in this paper, we investigate text promptable surgical instrument segmentation and propose SP-SAM (SurgicalPart-SAM), a novel efficient-tuning approach that integrates surgical instrument structure knowledge with the generic segmentation knowledge of SAM. Specifically, we achieve this by proposing (1) collaborative prompts in the text form "[part name] of [instrument category name]" that decompose instruments into fine-grained parts; (2) a Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) a Part-to-Whole Selective Fusion and a Hierarchical Decoding strategy that selectively assemble the part-level representations into a whole for accurate instrument segmentation. Built upon them, SP-SAM acquires a better capability to comprehend surgical instrument structures and distinguish between various categories. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with minimal tunable parameters. Code is at https://github.com/wenxi-yue/SurgicalPart-SAM.

翻译：基础模型如分割一切模型（SAM）在通用目标分割中展现出潜力。然而，将SAM直接应用于手术器械分割面临关键挑战。首先，SAM依赖逐帧的点或框提示，增加了医患交互的复杂性。此外，由于预训练中手术数据不足，以及各类手术器械的复杂结构与精细细节，SAM在手术器械分割上表现欠佳。为解决这些问题，本文研究基于文本提示的手术器械分割，并提出SP-SAM（手术部分分割模型），一种将手术器械结构知识与SAM通用分割知识相结合的新型高效微调方法。具体而言，我们通过以下创新实现： (1) 提出文本形式的协作式提示"[器械类别名称]的[部件名称]"，将器械分解为细粒度部分； (2) 设计跨模态提示编码器，将文本提示与视觉嵌入联合编码为判别性部件级表征； (3) 提出部分到整体选择性融合与层次化解码策略，将部件级表征选择性组合为整体，实现精准器械分割。基于此框架，SP-SAM具备更优的手术器械结构理解能力与类别区分能力。在EndoVis2018与EndoVis2017数据集上的广泛实验表明，SP-SAM以最少的可调参数实现了最先进的性能。代码发布在 https://github.com/wenxi-yue/SurgicalPart-SAM。