The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
翻译:多模态大语言模型(如GPT-4V)的快速发展已带来显著进步。然而,由于数据隐私担忧和高昂标注成本导致的医学视觉-文本数据在数量和质量上的限制,这些模型在医学多模态能力方面仍面临挑战。虽然前沿方法利用PubMed的大规模、去标识化医学图像-文本对来解决这些限制,但由于固有的数据噪声,它们仍存在不足。为此,我们从PubMed中精炼了医学图像-文本对,并以“非盲态”方式使用多模态大语言模型(GPT-4V)对数据进行去噪和重构,从而创建了包含130万个医学视觉问答样本的PubMedVision数据集。我们的验证表明:(1)PubMedVision能显著增强当前多模态大语言模型的医学多模态能力,在包括MMMU健康与医学赛道在内的基准测试中显示出显著提升;(2)医学专家的人工检查和实证结果验证了我们的数据集相比其他数据构建方法具有更优的数据质量。利用PubMedVision,我们训练了一个340亿参数的医学多模态大语言模型华佗GPT-视觉,该模型在开源多模态大语言模型中展现出卓越的医学多模态场景性能。