Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.
翻译:医学图像分割能够量化目标结构的大小与形状,有助于疾病诊断、预后评估、手术规划及病理理解。基于近期从自然图像-文本对中构建的基础视觉-语言模型(VLMs)的进展,多项研究提出将其适配为视觉-语言分割模型(VLSMs),使得语言文本可作为分割模型的附加输入。在推理过程中通过人工参与提示的文本引入辅助信息,开辟了独特的应用前景,例如开放词汇分割以及可能对分布外数据具有更强鲁棒性的分割模型。尽管从自然图像到医学图像的迁移学习已在纯图像分割模型中得到探索,但视觉-语言联合表示在分割问题中的应用仍研究不足。本研究首次系统性地探索了将VLSMs迁移至二维医学图像的任务,通过精心构建的$11$个涵盖多种模态的数据集,结合具有洞察力的语言提示与实验设计。研究结果表明:虽然在有限医学图像数据集上微调后,VLSMs相较于纯图像模型展现出具有竞争力的分割性能,但并非所有VLSMs都能有效利用语言提示提供的附加信息,图像特征仍占据主导地位。尽管VLSMs在处理多模态混合数据集时表现出更强的性能,且相较于传统分割模型可能对领域偏移具有更好的鲁棒性,但我们的研究指出仍需开发新方法以使VLSMs能够充分利用语言提示所提供的多样化辅助信息。代码与数据集已公开于https://github.com/naamiinepal/medvlsm。