LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophistication in understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-following data from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically, the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-following data, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-following data and the LLaVA-Med model.

翻译：对话式生成式AI在赋能生物医学从业者方面展现出显著潜力，但当前研究主要集中于单模态文本。多模态对话式AI通过利用公共网络上的数十亿图像-文本对取得了快速进展，但这类通用领域的视觉语言模型在理解和讨论生物医学图像方面仍缺乏专业性。本文提出了一种经济高效的方法，用于训练能够回答生物医学图像开放研究问题的视觉语言对话助手。核心思路是利用从PubMed Central提取的大规模、广覆盖的生物医学图像-标题数据集，借助GPT-4从标题中自生成开放式的指令遵循数据，并通过创新的课程学习方法微调大型通用领域视觉语言模型。具体而言，模型首先通过原始图像-标题对学习对齐生物医学词汇，随后利用GPT-4生成的指令遵循数据掌握开放式对话语义，模拟普通人逐步习得生物医学知识的过程。这使得我们能在不到15小时（使用八块A100 GPU）内训练出生物医学领域的大型语言与视觉助手（LLaVA-Med）。LLaVA-Med展现出卓越的多模态对话能力，可遵循开放式指令协助解答生物医学图像相关问题。在三个标准生物医学视觉问答数据集上，LLaVA-Med在特定指标上超越了此前有监督的最优方法。为促进生物医学多模态研究，我们将公开本研究的指令遵循数据及LLaVA-Med模型。