The Segment Anything Model (SAM) exhibits promise in generic object segmentation and offers potential for various applications. Existing methods have applied SAM to surgical instrument segmentation (SIS) by tuning SAM-based frameworks with surgical data. However, they fall short in two crucial aspects: (1) Straightforward model tuning with instrument masks treats each instrument as a single entity, neglecting their complex structures and fine-grained details; and (2) Instrument category-based prompts are not flexible and informative enough to describe instrument structures. To address these problems, in this paper, we investigate text promptable SIS and propose SurgicalPart-SAM (SP-SAM), a novel SAM efficient-tuning approach that explicitly integrates instrument structure knowledge with SAM's generic knowledge, guided by expert knowledge on instrument part compositions. Specifically, we achieve this by proposing (1) Collaborative Prompts that describe instrument structures via collaborating category-level and part-level texts; (2) Cross-Modal Prompt Encoder that encodes text prompts jointly with visual embeddings into discriminative part-level representations; and (3) Part-to-Whole Adaptive Fusion and Hierarchical Decoding that adaptively fuse the part-level representations into a whole for accurate instrument segmentation in surgical scenarios. Built upon them, SP-SAM acquires a better capability to comprehend surgical instruments in terms of both overall structure and part-level details. Extensive experiments on both the EndoVis2018 and EndoVis2017 datasets demonstrate SP-SAM's state-of-the-art performance with minimal tunable parameters. The code will be available at https://github.com/wenxi-yue/SurgicalPart-SAM.
翻译:分割一切模型(SAM)在通用物体分割中展现出潜力,并为多种应用提供了可能。现有方法通过基于手术数据微调SAM框架,将SAM应用于手术器械分割(SIS)。然而,这些方法在两个关键方面存在不足:(1) 直接使用器械掩膜进行模型微调,将每个器械视为单一实体,忽略了其复杂结构与细粒度细节;(2) 基于器械类别的提示不够灵活且信息量不足,难以描述器械结构。为解决这些问题,本文研究了文本可提示的手术器械分割,并提出SurgicalPart-SAM(SP-SAM)——一种新颖的SAM高效微调方法,该方法在专家关于器械部件构成知识的引导下,显式地将器械结构知识与SAM的通用知识相结合。具体而言,我们通过以下方式实现:(1) 协作提示(Collaborative Prompts),通过协作类别级与部件级文本来描述器械结构;(2) 跨模态提示编码器(Cross-Modal Prompt Encoder),将文本提示与视觉嵌入联合编码为区分性部件级表示;(3) 部件到整体自适应融合与分层解码(Part-to-Whole Adaptive Fusion and Hierarchical Decoding),自适应融合部件级表示形成整体,以实现手术场景中精确的器械分割。基于这些组件,SP-SAM在理解手术器械的整体结构与部件级细节方面获得了更优能力。在EndoVis2018和EndoVis2017数据集上的大量实验表明,SP-SAM以最少的可调参数实现了最先进的性能。代码将发布于https://github.com/wenxi-yue/SurgicalPart-SAM。