Training Small Multimodal Models to Bridge Biomedical Competency Gap: A Case Study in Radiology Imaging

Juan Manuel Zambrano Chaves,Shih-Cheng Huang,Yanbo Xu,Hanwen Xu,Naoto Usuyama,Sheng Zhang,Fei Wang,Yujia Xie,Mahmoud Khademi,Ziyi Yang,Hany Awadalla,Julia Gong,Houdong Hu,Jianwei Yang,Chunyuan Li,Jianfeng Gao,Yu Gu,Cliff Wong,Mu Wei,Tristan Naumann,Muhao Chen,Matthew P. Lungren,Serena Yeung-Levy,Curtis P. Langlotz,Sheng Wang,Hoifung Poon

The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such large models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world applications. Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications. Moreover, pragmatic issues such as access, cost, latency, and compliance make it hard for clinicians to use privately-hosted state-of-the-art large models directly on private patient data. In this paper, we explore training open-source small multimodal models (SMMs) to bridge biomedical competency gaps for unmet clinical needs. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space. We conduct a comprehensive study of this approach on radiology imaging. For training, we assemble a large dataset with over 1 million image-text pairs. For evaluation, we propose a clinically driven novel approach using GPT-4 and demonstrate its parity with expert evaluation. We also study grounding qualitatively using attention. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LLaVA-Rad (7B) model attains state-of-the-art results on radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). LLaVA-Rad is fast and can be run on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.

翻译：大模型的扩展规律及其卓越表现推动了此类模型在生物医学领域的开发与应用。然而，尽管在部分生物医学基准测试上取得了初步成效，这些模型在临床应用前仍面临重大挑战。GPT-4V等前沿模型在生物医学多模态能力上仍存在显著短板。此外，可访问性、成本、延迟及合规性等现实问题，使得临床医生难以直接将私有化部署的最先进大模型应用于患者隐私数据。本文探索通过训练开源小型多模态模型（SMMs）来弥合生物医学领域未满足临床需求的能力差距。为最大化数据效率，我们采用模块化方法：集成图像和文本模态的最先进预训练模型，重点训练轻量级适配器将各模态映射至文本嵌入空间。我们在放射影像领域对该方法进行了系统研究。训练阶段，我们构建了包含超过100万对图像-文本的大规模数据集。评估阶段，我们提出基于GPT-4的临床驱动新方法，验证其与专家评估的一致性，并利用注意力机制进行定性基础分析。针对最佳实践，我们对数据工程与多模态训练的多种方案进行系统消融实验。最终提出的LLaVA-Rad（7B）模型在报告生成和跨模态检索等放射学任务上达到最优性能，甚至超越GPT-4V和Med-PaLM M（84B）等更大模型。该模型响应迅速，可在私有环境下单卡V100 GPU运行，为真实临床场景提供了有前景的最优工具。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日