ASI-Seg: Audio-Driven Surgical Instrument Segmentation with Surgeon Intention Understanding

Surgical instrument segmentation is crucial in surgical scene understanding, thereby facilitating surgical safety. Existing algorithms directly detected all instruments of pre-defined categories in the input image, lacking the capability to segment specific instruments according to the surgeon's intention. During different stages of surgery, surgeons exhibit varying preferences and focus toward different surgical instruments. Therefore, an instrument segmentation algorithm that adheres to the surgeon's intention can minimize distractions from irrelevant instruments and assist surgeons to a great extent. The recent Segment Anything Model (SAM) reveals the capability to segment objects following prompts, but the manual annotations for prompts are impractical during the surgery. To address these limitations in operating rooms, we propose an audio-driven surgical instrument segmentation framework, named ASI-Seg, to accurately segment the required surgical instruments by parsing the audio commands of surgeons. Specifically, we propose an intention-oriented multimodal fusion to interpret the segmentation intention from audio commands and retrieve relevant instrument details to facilitate segmentation. Moreover, to guide our ASI-Seg segment of the required surgical instruments, we devise a contrastive learning prompt encoder to effectively distinguish the required instruments from the irrelevant ones. Therefore, our ASI-Seg promotes the workflow in the operating rooms, thereby providing targeted support and reducing the cognitive load on surgeons. Extensive experiments are performed to validate the ASI-Seg framework, which reveals remarkable advantages over classical state-of-the-art and medical SAMs in both semantic segmentation and intention-oriented segmentation. The source code is available at https://github.com/Zonmgin-Zhang/ASI-Seg.

翻译：手术器械分割对于手术场景理解至关重要，进而有助于保障手术安全。现有算法直接在输入图像中检测预定义类别的所有器械，缺乏根据外科医生意图分割特定器械的能力。在手术的不同阶段，外科医生对不同手术器械表现出不同的偏好和关注重点。因此，遵循外科医生意图的器械分割算法能够最大程度减少无关器械的干扰，从而为外科医生提供有力协助。近期提出的Segment Anything Model（SAM）展现了基于提示分割目标的能力，但手术过程中手动标注提示并不现实。为应对手术室中的这些局限性，我们提出一种音频驱动的手术器械分割框架ASI-Seg，通过解析外科医生的语音指令来精确分割所需手术器械。具体而言，我们提出面向意图的多模态融合方法，从语音指令中解析分割意图并检索相关器械细节以促进分割。此外，为引导ASI-Seg分割所需器械，我们设计了对比学习提示编码器，以有效区分所需器械与无关器械。因此，我们的ASI-Seg能够优化手术室工作流程，从而提供针对性支持并降低外科医生的认知负荷。大量实验验证了ASI-Seg框架的有效性，其在语义分割和意图导向分割任务中均展现出相较于经典先进方法和医学SAM模型的显著优势。源代码公开于https://github.com/Zonmgin-Zhang/ASI-Seg。