Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at https://github.com/mbzuai-oryx/UniMed-CLIP.
翻译:通过对比学习训练的视觉-语言模型在自然图像任务中已取得显著成功。然而,由于公开可用的大规模医学图文数据集稀缺,其在医学领域的应用仍然有限。现有的医学视觉-语言模型要么在封闭的专有数据集上训练,要么在相对较小的开源数据集上训练,导致泛化能力不足。同样,大多数模型仍局限于单一或少数几种医学成像领域,这进一步限制了其向其他模态的适用性。为填补这一空白,我们提出了UniMed——一个大规模、开源的多模态医学数据集,包含超过530万对图文数据,涵盖六种不同的成像模态:X射线、CT、MRI、超声、病理学和眼底成像。UniMed采用一种数据收集框架构建,该框架利用大语言模型将特定模态的分类数据集转换为图文格式,同时整合医学领域现有的图文数据,从而支持可扩展的视觉-语言模型预训练。基于UniMed,我们训练了UniMed-CLIP——一个面向六种模态的统一视觉-语言模型,其性能显著优于现有的通用视觉-语言模型,并与特定模态的医学视觉-语言模型相当,在零样本评估中取得了显著提升。例如,在21个数据集上的平均性能,UniMed-CLIP相较于基于专有数据训练的BiomedCLIP实现了+12.61的绝对增益,同时使用的训练数据量减少了三倍。为促进未来研究,我们在https://github.com/mbzuai-oryx/UniMed-CLIP 公开了UniMed数据集、训练代码和模型。