Training segmentation models for medical images continues to be challenging due to the limited availability and acquisition expense of data annotations. Segment Anything Model (SAM) is a foundation model trained on over 1 billion annotations, predominantly for natural images, that is intended to be able to segment the user-defined object of interest in an interactive manner. Despite its impressive performance on natural images, it is unclear how the model is affected when shifting to medical image domains. Here, we perform an extensive evaluation of SAM's ability to segment medical images on a collection of 11 medical imaging datasets from various modalities and anatomies. In our experiments, we generated point prompts using a standard method that simulates interactive segmentation. Experimental results show that SAM's performance based on single prompts highly varies depending on the task and the dataset, i.e., from 0.1135 for a spine MRI dataset to 0.8650 for a hip x-ray dataset, evaluated by IoU. Performance appears to be high for tasks including well-circumscribed objects with unambiguous prompts and poorer in many other scenarios such as segmentation of tumors. When multiple prompts are provided, performance improves only slightly overall, but more so for datasets where the object is not contiguous. An additional comparison to RITM showed a much better performance of SAM for one prompt but a similar performance of the two methods for a larger number of prompts. We conclude that SAM shows impressive performance for some datasets given the zero-shot learning setup but poor to moderate performance for multiple other datasets. While SAM as a model and as a learning paradigm might be impactful in the medical imaging domain, extensive research is needed to identify the proper ways of adapting it in this domain.
翻译:由于数据标注的有限性和获取成本高昂,在医学图像上训练分割模型仍然充满挑战。Segment Anything Model(SAM)是一个基础模型,它基于超过10亿个标注(主要来自自然图像)进行训练,旨在以交互方式分割用户定义的目标对象。尽管SAM在自然图像上表现出色,但其在转向医学图像领域时的影响尚不清楚。本文对SAM在医学图像上的分割能力进行了广泛评估,使用了来自不同模态和解剖结构的11个医学影像数据集。在实验中,我们采用模拟交互分割的标准方法生成点提示。实验结果表明,基于单一提示的SAM性能因任务和数据集的不同而有显著差异,例如,以交并比(IoU)为评估指标,其性能从脊柱MRI数据集的0.1135到髋关节X射线数据集的0.8650不等。对于具有明确提示的清晰边界目标的分割任务,SAM表现较高,而在许多其他场景(如肿瘤分割)中表现较差。当提供多个提示时,整体性能仅略有提升,但在目标不连续的数据集上提升更为明显。与RITM的额外比较显示,SAM在单一提示下表现更好,但在更多提示下两种方法性能相似。我们得出结论:在零样本学习设置下,SAM在某些数据集上表现出令人印象深刻的分割性能,但在多个其他数据集上表现较差至中等。尽管SAM作为模型和学习范式可能在医学影像领域具有影响力,但仍需进行广泛研究以确定其在此领域的适当适应方法。