We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.
翻译:我们对医学领域学习多模态表征的对比学习框架进行了全面的基准测试。通过这项研究,我们旨在回答以下研究问题:(i) 通用领域表征向医学领域的可迁移性如何?(ii) 多模态对比训练是否足够,还是也能从单模态训练中受益?(iii) 特征粒度对多模态医学表征学习效果有何影响?为回答这些问题,我们在相同的训练设置下研究了八种对比学习方法,使用来自四个数据集的280万图像-文本对进行训练,并在25个下游任务上对其评估,包括分类(零样本和线性探测)、图像到文本和文本到图像检索以及视觉问答。我们的研究结果表明,对第一个问题持肯定回答,对第二个问题持否定回答,并证实了学习细粒度特征的益处。最后,我们公开了代码。