The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
翻译:以GPT-4V为代表的多模态大语言模型(MLLMs)的快速发展已带来显著进步。然而,由于数据隐私问题和高昂标注成本导致的医学视觉-文本数据在数量与质量上的局限,这些模型在医学多模态能力方面仍面临挑战。虽然前沿方法利用PubMed的大规模去标识化医学图像-文本对来解决这些限制,但固有的数据噪声仍使其效果不足。为此,我们从PubMed中精炼医学图像-文本对,并以“非盲态”方式运用MLLMs(GPT-4V)进行去噪与数据重构,最终构建了包含130万个医学视觉问答样本的PubMedVision数据集。我们的验证表明:(1)PubMedVision能显著增强当前MLLMs的医学多模态能力,在包括MMMU健康与医学分赛道在内的基准测试中展现出明显提升;(2)医学专家的人工核查与实证结果均证实,相较于其他数据构建方法,我们的数据集具有更优的数据质量。基于PubMedVision,我们训练了一个340亿参数的医学多模态大语言模型华佗GPT-Vision,该模型在开源MLLMs中展现出卓越的医学多模态场景性能。