Vision-Language Pretraining (VLP) has demonstrated remarkable capabilities in learning visual representations from textual descriptions of images without annotations. Yet, effective VLP demands large-scale image-text pairs, a resource that suffers scarcity in the medical domain. Moreover, conventional VLP is limited to 2D images while medical images encompass diverse modalities, often in 3D, making the learning process more challenging. To address these challenges, we present Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation (GTGM), a framework that extends of VLP to 3D medical images without relying on paired textual descriptions. Specifically, GTGM utilizes large language models (LLM) to generate medical-style text from 3D medical images. This synthetic text is then used to supervise 3D visual representation learning. Furthermore, a negative-free contrastive learning objective strategy is introduced to cultivate consistent visual representations between augmented 3D medical image patches, which effectively mitigates the biases associated with strict positive-negative sample pairings. We evaluate GTGM on three imaging modalities - Computed Tomography (CT), Magnetic Resonance Imaging (MRI), and electron microscopy (EM) over 13 datasets. GTGM's superior performance across various medical image segmentation tasks underscores its effectiveness and versatility, by enabling VLP extension into 3D medical imagery while bypassing the need for paired text.
翻译:视觉-语言预训练(VLP)在无需标注的情况下,通过图像的文字描述来学习视觉表征方面展现出了卓越能力。然而,有效的VLP需要大规模图像-文本对,而医学领域此类资源极为匮乏。此外,传统VLP仅限于二维图像,而医学图像涵盖多种模态,且常为三维,这使得学习过程更具挑战性。为解决这些问题,我们提出了生成式文本引导的三维视觉-语言预训练用于统一医学图像分割(GTGM)——一个将VLP扩展到三维医学图像且无需依赖配对文本描述的框架。具体而言,GTGM利用大语言模型(LLM)从三维医学图像生成医学风格文本,再以此合成文本监督三维视觉表征学习。此外,我们引入了一种无负样本对比学习目标策略,以培养增强后的三维医学图像块之间一致的视觉表征,从而有效缓解严格正负样本配对带来的偏差。我们在13个数据集上,对计算机断层扫描(CT)、磁共振成像(MRI)和电子显微镜(EM)三种成像模态评估了GTGM。GTGM在各种医学图像分割任务中的优异表现,证明了其有效性与通用性——它使得VLP能够扩展到三维医学影像,同时绕开了对配对文本的需求。