Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse contrastive pairs. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
翻译:医学视觉-语言预训练(Med-VLP)建立了医学图像视觉内容与相关文本描述之间的连接。现有Med-VLP方法主要聚焦于描述单一身体部位的二维图像,尤其是胸部X光片。本文通过使用CT图像和报告的多模态数据集,将Med-VLP的范围扩展至涵盖三维图像,并特别针对全身场景。与二维对应方法相比,三维VLP需要有效捕捉三维成像中显著稀疏表示的本质语义。本文提出CT-GLIP(基于CT扫描的接地语言-图像预训练),一种新颖方法,通过构建器官级别的图像-文本对来增强多模态对比学习,将接地视觉特征与精确诊断文本对齐。此外,我们开发了异常词典,通过多样化对比对增强对比学习。该方法使用包含来自17,702名患者、涉及104个器官的44,011个器官级视觉-文本对的多模态CT数据集进行训练,展示了利用自然语言以零样本方式识别器官和异常的能力。CT-GLIP的性能在包含1,130名患者的独立测试集上得到验证,重点关注7个器官中的16种最常见异常。实验结果表明,在零样本和微调场景下,采用CNN和ViT架构时,我们的模型均优于标准CLIP框架。