UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at https://github.com/mbzuai-oryx/UniMed-CLIP.

翻译：通过对比学习训练的视觉-语言模型在自然图像任务中已取得显著成功。然而，由于公开可用的大规模医学图文数据集稀缺，其在医学领域的应用仍然有限。现有的医学视觉-语言模型要么在封闭的专有数据集上训练，要么在相对较小的开源数据集上训练，导致泛化能力不足。同样，大多数模型仍局限于单一或少数几种医学成像领域，这进一步限制了其向其他模态的适用性。为填补这一空白，我们提出了UniMed——一个大规模、开源的多模态医学数据集，包含超过530万对图文数据，涵盖六种不同的成像模态：X射线、CT、MRI、超声、病理学和眼底成像。UniMed采用一种数据收集框架构建，该框架利用大语言模型将特定模态的分类数据集转换为图文格式，同时整合医学领域现有的图文数据，从而支持可扩展的视觉-语言模型预训练。基于UniMed，我们训练了UniMed-CLIP——一个面向六种模态的统一视觉-语言模型，其性能显著优于现有的通用视觉-语言模型，并与特定模态的医学视觉-语言模型相当，在零样本评估中取得了显著提升。例如，在21个数据集上的平均性能，UniMed-CLIP相较于基于专有数据训练的BiomedCLIP实现了+12.61的绝对增益，同时使用的训练数据量减少了三倍。为促进未来研究，我们在https://github.com/mbzuai-oryx/UniMed-CLIP 公开了UniMed数据集、训练代码和模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日