The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
翻译:摘要:大规模视觉语言模型的快速发展已在各类任务中展现出卓越能力。然而,医学领域缺乏大规模且高质量的图像-文本数据,严重制约了大规模医学视觉语言模型的发展。本研究提出了一种诊断引导的引导式学习策略,利用图像与标签信息共同构建视觉语言数据集。基于所构建的数据集,我们开发了MedDr——一种面向医疗领域的通用基础模型,能够处理放射学、病理学、皮肤病学、视网膜成像及内窥镜等多种医学数据模态。此外,在推理阶段,我们提出了一种简单但有效的检索增强型医学诊断策略,从而提升了模型的泛化能力。在视觉问答、医学报告生成及医学图像诊断等任务上的广泛实验验证了本方法的优越性。