Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Medical image segmentation with deep learning is an important and widely studied topic because segmentation enables quantifying target structure size and shape that can help in disease diagnosis, prognosis, surgery planning, and understanding. Recent advances in the foundation VLMs and their adaptation to segmentation tasks in natural images with VLSMs have opened up a unique opportunity to build potentially powerful segmentation models for medical images that enable providing helpful information via language prompt as input, leverage the extensive range of other medical imaging datasets by pooled dataset training, adapt to new classes, and be robust against out-of-distribution data with human-in-the-loop prompting during inference. Although transfer learning from natural to medical images for image-only segmentation models has been studied, no studies have analyzed how the joint representation of vision-language transfers to medical images in segmentation problems and understand gaps in leveraging their full potential. We present the first benchmark study on transfer learning of VLSMs to 2D medical images with thoughtfully collected 11 existing 2D medical image datasets of diverse modalities with carefully presented 9 types of language prompts from 14 attributes. Our results indicate that VLSMs trained in natural image-text pairs transfer reasonably to the medical domain in zero-shot settings when prompted appropriately for non-radiology photographic modalities; when finetuned, they obtain comparable performance to conventional architectures, even in X-rays and ultrasound modalities. However, the additional benefit of language prompts during finetuning may be limited, with image features playing a more dominant role; they can better handle training on pooled datasets combining diverse modalities and are potentially more robust to domain shift than the conventional segmentation models.

翻译：基于深度学习的医学图像分割是一项重要且广泛研究的课题，因为分割能够量化目标结构的尺寸与形状，有助于疾病诊断、预后判断、手术规划及病理理解。近期基础视觉-语言模型（VLM）的进展及其通过视觉-语言分割模型（VLSM）在自然图像分割任务中的适配，为构建潜在的强大医学图像分割模型开辟了独特机遇：该类模型可通过语言提示作为输入提供辅助信息，利用多数据集联合训练整合广泛的医学影像数据资源，适应新类别，并在推理阶段通过人机交互提示增强对分布外数据的鲁棒性。尽管从自然图像到医学图像的纯图像分割模型迁移学习已有研究，但尚无工作分析视觉-语言联合表征在分割问题中向医学图像的迁移机制，以及挖掘其全部潜力时存在的认知空白。我们首次系统开展了VLSM向二维医学图像的迁移学习基准研究，精心收集了涵盖多种模态的11个现有二维医学图像数据集，并基于14个属性精心构建了9类语言提示。结果表明：在零样本设置下，通过适当提示非放射照片模态，基于自然图像-文本对训练的VLSM可合理迁移至医学领域；经微调后，即使在X光与超声模态中，其性能也与传统架构相当。然而，微调过程中语言提示的额外增益可能受限，图像特征起主导作用；但VLSM能更好地处理包含多种模态的混合数据集联合训练，且相较于传统分割模型对领域偏移具有潜在更强的鲁棒性。