Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing

Sheng Zhang,Yanbo Xu,Naoto Usuyama,Jaspreet Bagga,Robert Tinn,Sam Preston,Rajesh Rao,Mu Wei,Naveen Valluri,Cliff Wong,Matthew P. Lungren,Tristan Naumann,Hoifung Poon

from arxiv, The models will be released soon at https://aka.ms/biomedclip

Contrastive pretraining on parallel image-text data has attained great success in vision-language processing (VLP), as exemplified by CLIP and related methods. However, prior explorations tend to focus on general domains in the web. Biomedical images and text are rather different, but publicly available datasets are small and skew toward chest X-ray, thus severely limiting progress. In this paper, we conducted by far the largest study on biomedical VLP, using 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central. Our dataset (PMC-15M) is two orders of magnitude larger than existing biomedical image-text datasets such as MIMIC-CXR, and spans a diverse range of biomedical images. The standard CLIP method is suboptimal for the biomedical domain. We propose BiomedCLIP with domain-specific adaptations tailored to biomedical VLP. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP established new state of the art in a wide range of standard datasets, substantially outperformed prior VLP approaches. Surprisingly, BiomedCLIP even outperformed radiology-specific state-of-the-art models such as BioViL on radiology-specific tasks such as RSNA pneumonia detection, thus highlighting the utility in large-scale pretraining across all biomedical image types. We will release our models at https://aka.ms/biomedclip to facilitate future research in biomedical VLP.

翻译：在并行图像-文本数据上的对比预训练已在视觉-语言处理（VLP）领域取得巨大成功，CLIP及其相关方法便是明证。然而，以往的探索往往集中于网络通用领域。生物医学图像与文本具有显著差异，但公开数据集规模较小且偏向胸部X光片，严重制约了该领域的发展。本文开展了迄今最大规模的生物医学VLP研究，利用从PubMed Central生物医学研究论文中提取的1500万组图文对（PMC-15M数据集）。该数据集规模比现有生物医学图像-文本数据集（如MIMIC-CXR）大两个数量级，并涵盖多种生物医学图像类型。标准CLIP方法在生物医学领域表现次优。我们提出BiomedCLIP，针对生物医学VLP进行了领域特定的适应性改进。我们在标准生物医学成像任务（从检索、分类到视觉问答（VQA））上开展了大量实验与消融研究。BiomedCLIP在多种标准数据集上创下新的最优水平，显著超越先前的VLP方法。令人惊讶的是，在放射学特定任务（如RSNA肺炎检测）上，BiomedCLIP甚至超越了放射学领域的最优模型（如BioViL），凸显了跨所有生物医学图像类型的大规模预训练的实用性。我们将在https://aka.ms/biomedclip公开模型，以促进生物医学VLP领域的未来研究。