Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations. Such pretrained models have the potential to enhance multiple downstream medical tasks simultaneously, reducing the dependency on labeled data. However, despite recent progress and its potential, there is no such comprehensive survey paper that has explored the various aspects and advancements in medical VLP. In this paper, we specifically review existing works through the lens of different pretraining objectives, architectures, downstream evaluation tasks, and datasets utilized for pretraining and downstream tasks. Subsequently, we delve into current challenges in medical VLP, discussing existing and potential solutions, and conclude by highlighting future directions. To the best of our knowledge, this is the first survey focused on medical VLP.
翻译:医学视觉语言预训练(VLP)近期已成为解决医学领域标记数据稀缺问题的有前景方案。通过利用自监督学习处理配对/非配对的视觉与文本数据集,模型能够获取广泛知识并学习鲁棒的特征表示。这类预训练模型有望同时增强多项下游医学任务,减少对标记数据的依赖。然而,尽管近年来取得进展且潜力巨大,目前尚无系统综述论文全面探讨医学VLP的各个方面与进展。本文从不同预训练目标、架构、下游评估任务以及预训练与下游任务所用数据集的角度,对现有研究进行专门评述。随后,我们深入探讨当前医学VLP面临的挑战,讨论现有及潜在解决方案,最后通过展望未来研究方向进行总结。据我们所知,这是首篇聚焦医学VLP的综述论文。