Medicine, by its nature, is a multifaceted domain that requires the synthesis of information across various modalities. Medical generative vision-language models (VLMs) make a first step in this direction and promise many exciting clinical applications. However, existing models typically have to be fine-tuned on sizeable down-stream datasets, which poses a significant limitation as in many medical applications data is scarce, necessitating models that are capable of learning from few examples in real-time. Here we propose Med-Flamingo, a multimodal few-shot learner adapted to the medical domain. Based on OpenFlamingo-9B, we continue pre-training on paired and interleaved medical image-text data from publications and textbooks. Med-Flamingo unlocks few-shot generative medical visual question answering (VQA) abilities, which we evaluate on several datasets including a novel challenging open-ended VQA dataset of visual USMLE-style problems. Furthermore, we conduct the first human evaluation for generative medical VQA where physicians review the problems and blinded generations in an interactive app. Med-Flamingo improves performance in generative medical VQA by up to 20\% in clinician's rating and firstly enables multimodal medical few-shot adaptations, such as rationale generation. We release our model, code, and evaluation app under https://github.com/snap-stanford/med-flamingo.
翻译:医学本质上是一个多模态信息整合的复杂领域。医学生成式视觉语言模型(VLM)朝此方向迈出了第一步,并有望实现许多令人兴奋的临床应用。然而,现有模型通常需要在大规模下游数据集上进行微调,这构成了显著限制——许多医学应用场景中数据稀缺,因此亟需能够从少量样本中实时学习的模型。本文提出Med-Flamingo,一种针对医学领域优化的多模态少样本学习器。基于OpenFlamingo-9B,我们在出版物和教科书中的成对及交错医学图像-文本数据上继续预训练。Med-Flamingo解锁了少样本生成式医学视觉问答(VQA)能力,我们在多个数据集上对其进行了评估,包括一个具有挑战性的新型开放式视觉USMLE风格问题VQA数据集。此外,我们首次开展了生成式医学VQA的人工评估,由医生通过交互式应用程序审查问题并进行盲评。Med-Flamingo在临床医生评分中将生成式医学VQA性能提升高达20%,并首次实现多模态医学少样本适应性应用(如推理生成)。我们已在https://github.com/snap-stanford/med-flamingo 发布模型、代码及评估应用程序。