Medical vision-language pretraining models (VLPM) have achieved remarkable progress in fusing chest X-rays (CXR) with clinical texts, introducing image-text data binding approaches that enable zero-shot learning and downstream clinical tasks. However, the current landscape lacks the holistic integration of additional medical modalities, such as electrocardiograms (ECG). We present MEDBind (Medical Electronic patient recorD), which learns joint embeddings across CXR, ECG, and medical text. Using text data as the central anchor, MEDBind features tri-modality binding, delivering competitive performance in top-K retrieval, zero-shot, and few-shot benchmarks against established VLPM, and the ability for CXR-to-ECG zero-shot classification and retrieval. This seamless integration is achieved through combination of contrastive loss on modality-text pairs with our proposed contrastive loss function, Edge-Modality Contrastive Loss, fostering a cohesive embedding space for CXR, ECG, and text. Finally, we demonstrate that MEDBind can improve downstream tasks by directly integrating CXR and ECG embeddings into a large-language model for multimodal prompt tuning.
翻译:摘要:医学视觉-语言预训练模型(VLPM)在融合胸部X光片(CXR)与临床文本方面取得了显著进展,并引入了图像-文本数据绑定方法,实现了零样本学习和下游临床任务。然而,当前的研究格局缺乏对其他医学模态(如心电图ECG)的整体整合。我们提出MEDBind(Medical Electronic patient recorD),该模型学习CXR、ECG和医学文本的联合嵌入。以文本数据为中心锚点,MEDBind实现了三模态绑定,在Top-K检索、零样本和少样本基准测试中相较于现有VLPM表现出竞争性能,并具备CXR至ECG的零样本分类与检索能力。这一无缝整合通过结合模态-文本对的对比损失与我们提出的边缘模态对比损失函数来实现,从而为CXR、ECG和文本构建了统一的嵌入空间。最后,我们证明MEDBind可直接将CXR和ECG嵌入整合到大语言模型中,用于多模态提示调优,从而改进下游任务。