Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse contrastive pairs. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
翻译:医学视觉-语言预训练(Med-VLP)建立了医学图像视觉内容与相关文本描述之间的关联。现有Med-VLP方法主要聚焦于描述单一身体部位的二维图像,尤其是胸部X光片。本文通过利用CT图像和报告的多模态数据集,将Med-VLP的研究范围扩展至涵盖三维图像,特别是针对全身场景。与二维预训练相比,三维视觉-语言预训练需有效捕捉三维成像中显著稀疏表征中的核心语义。本文提出CT-GLIP(基于CT扫描的接地语言-图像预训练),一种构建器官级图像-文本对以增强多模态对比学习的新方法,实现将接地视觉特征与精确诊断文本对齐。此外,我们开发了异常字典以通过多样化对比对增强对比学习。该方法在包含44,011个器官级视觉-文本对(来自17,702名患者、覆盖104个器官)的多模态CT数据集上训练,证明其能以自然语言通过零样本方式识别器官和异常。CT-GLIP的性能在包含1,130名患者的独立测试集上验证,重点关注7个器官的16种最常见异常。实验结果表明,在零样本和微调场景下,我们的模型在使用CNN和ViT架构时均优于标准CLIP框架。