SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation

from arxiv, Accepted as Oral Presentation at MedAGI Workshop - MICCAI 2023 1st International Workshop on Foundation Models for General Medical AI. arXiv admin note: substantial text overlap with arXiv:2304.14674

The Segment Anything Model (SAM) serves as a fundamental model for semantic segmentation and demonstrates remarkable generalization capabilities across a wide range of downstream scenarios. In this empirical study, we examine SAM's robustness and zero-shot generalizability in the field of robotic surgery. We comprehensively explore different scenarios, including prompted and unprompted situations, bounding box and points-based prompt approaches, as well as the ability to generalize under corruptions and perturbations at five severity levels. Additionally, we compare the performance of SAM with state-of-the-art supervised models. We conduct all the experiments with two well-known robotic instrument segmentation datasets from MICCAI EndoVis 2017 and 2018 challenges. Our extensive evaluation results reveal that although SAM shows remarkable zero-shot generalization ability with bounding box prompts, it struggles to segment the whole instrument with point-based prompts and unprompted settings. Furthermore, our qualitative figures demonstrate that the model either failed to predict certain parts of the instrument mask (e.g., jaws, wrist) or predicted parts of the instrument as wrong classes in the scenario of overlapping instruments within the same bounding box or with the point-based prompt. In fact, SAM struggles to identify instruments in complex surgical scenarios characterized by the presence of blood, reflection, blur, and shade. Additionally, SAM is insufficiently robust to maintain high performance when subjected to various forms of data corruption. We also attempt to fine-tune SAM using Low-rank Adaptation (LoRA) and propose SurgicalSAM, which shows the capability in class-wise mask prediction without prompt. Therefore, we can argue that, without further domain-specific fine-tuning, SAM is not ready for downstream surgical tasks.

翻译：分段任意模型（SAM）作为语义分割的基础模型，在下游各类场景中展现出显著的泛化能力。本实证研究聚焦于SAM在机器人外科领域的鲁棒性与零样本泛化能力。我们系统探究了多种场景，包括带提示与无提示情境、边界框与点标注提示方法，以及五种严重程度下的数据退化与扰动泛化能力。同时，我们将SAM与当前最先进的监督学习方法进行性能对比。所有实验均基于MICCAI EndoVis 2017和2018挑战赛中的两个知名机器人手术器械分割数据集。广泛评估结果表明：尽管SAM在边界框提示下展现出卓越的零样本泛化能力，但在点标注提示与无提示设定中难以完整分割手术器械。定性分析显示，模型在边界框重叠或使用点标注提示的场景下，要么无法预测器械掩膜的部分区域（如夹爪、腕部），要么将器械部件错误分类。事实上，SAM在存在血液、反光、模糊与阴影的复杂手术场景中难以识别器械。此外，面对多种数据退化形式时，SAM的鲁棒性不足以维持高性能表现。我们尝试采用低秩适配（LoRA）方法微调SAM并提出了SurgicalSAM，该模型在无提示条件下展现出按类别预测掩膜的能力。因此可以断言：若缺乏领域特异性微调，SAM尚未具备胜任下游手术任务的能力。