BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang,Yanbo Xu,Naoto Usuyama,Hanwen Xu,Jaspreet Bagga,Robert Tinn,Sam Preston,Rajesh Rao,Mu Wei,Naveen Valluri,Cliff Wong,Andrea Tupini,Yu Wang,Matt Mazzola,Swadheen Shukla,Lars Liden,Jianfeng Gao,Matthew P. Lungren,Tristan Naumann,Sheng Wang,Hoifung Poon

from arxiv, The models are released at https://aka.ms/biomedclip

Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.

翻译：生物医学数据本质上具有多模态特性，包含物理测量数据和自然语言描述。通用型生物医学AI模型需要同时处理文本与图像等不同模态的数据。因此，训练有效的通用型生物医学模型需要高质量的多模态数据，例如平行图像-文本对。本文提出PMC-15M数据集，其规模比现有生物医学多模态数据集（如MIMIC-CXR）高两个数量级，且涵盖多样化的生物医学图像类型。PMC-15M包含从440万篇科学论文中收集的1500万对生物医学图像-文本对。基于PMC-15M，我们预训练了BiomedCLIP多模态基础模型，并针对生物医学视觉-语言处理进行了领域自适应优化。我们在从检索、分类到视觉问答（VQA）的标准生物医学成像任务上开展了大量实验和消融研究。BiomedCLIP在广泛的标准数据集上取得了新的最先进结果，显著优于现有方法。值得注意的是，通过对多种生物医学图像类型的大规模预训练，BiomedCLIP在放射学特定任务（如RSNA肺炎检测）中甚至超越了BioViL等最先进的放射学专用模型。综上所述，BiomedCLIP是一个完全开放获取的基础模型，在各类生物医学任务中均达到最先进性能，为变革性的多模态生物医学发现与应用铺平了道路。我们已在https://aka.ms/biomedclip发布模型，以促进多模态生物医学AI的未来研究。