Vision-language pre-training (VLP) models have been demonstrated to be effective in many computer vision applications. In this paper, we consider developing a VLP model in the medical domain for making computer-aided diagnoses (CAD) based on image scans and text descriptions in electronic health records, as done in practice. To achieve our goal, we present a lightweight CAD system MedBLIP, a new paradigm for bootstrapping VLP from off-the-shelf frozen pre-trained image encoders and frozen large language models. We design a MedQFormer module to bridge the gap between 3D medical images and 2D pre-trained image encoders and language models as well. To evaluate the effectiveness of our MedBLIP, we collect more than 30,000 image volumes from five public Alzheimer's disease (AD) datasets, i.e., ADNI, NACC, OASIS, AIBL, and MIRIAD. On this largest AD dataset we know, our model achieves the SOTA performance on the zero-shot classification of healthy, mild cognitive impairment (MCI), and AD subjects, and shows its capability of making medical visual question answering (VQA). The code and pre-trained models is available online: https://github.com/Qybc/MedBLIP.
翻译:摘要:视觉-语言预训练(VLP)模型已在众多计算机视觉应用中展现出有效性。本文旨在医疗领域开发一种基于影像扫描与电子健康记录文本描述的VLP模型,以实践计算机辅助诊断(CAD)。为实现该目标,我们提出轻量级CAD系统MedBLIP——这是一种从现成的冻结预训练图像编码器和冻结大型语言模型中引导VLP的新范式。我们设计了MedQFormer模块,用以弥合三维医学图像与二维预训练图像编码器及语言模型之间的差异。为评估MedBLIP的有效性,我们从五个公开的阿尔茨海默病(AD)数据集(即ADNI、NACC、OASIS、AIBL和MIRIAD)中收集了超过30,000个影像体素。在此已知的最大AD数据集上,我们的模型在健康人群、轻度认知障碍(MCI)人群和AD人群的零样本分类中达到了最先进的性能,并展示了其进行医学视觉问答(VQA)的能力。代码与预训练模型已公开于:https://github.com/Qybc/MedBLIP。