The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such large models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world applications. Frontier models such as GPT-4V still have major competency gaps in multimodal capabilities for biomedical applications. Moreover, pragmatic issues such as access, cost, latency, and compliance make it hard for clinicians to use privately-hosted state-of-the-art large models directly on private patient data. In this paper, we explore training open-source small multimodal models (SMMs) to bridge biomedical competency gaps for unmet clinical needs. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space. We conduct a comprehensive study of this approach on radiology imaging. For training, we assemble a large dataset with over 1 million image-text pairs. For evaluation, we propose a clinically driven novel approach using GPT-4 and demonstrate its parity with expert evaluation. We also study grounding qualitatively using attention. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LLaVA-Rad (7B) model attains state-of-the-art results on radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). LLaVA-Rad is fast and can be run on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
翻译:大模型的扩展规律及其卓越表现推动了此类模型在生物医学领域的开发与应用。然而,尽管在部分生物医学基准测试上取得了初步成效,这些模型在临床应用前仍面临重大挑战。GPT-4V等前沿模型在生物医学多模态能力上仍存在显著短板。此外,可访问性、成本、延迟及合规性等现实问题,使得临床医生难以直接将私有化部署的最先进大模型应用于患者隐私数据。本文探索通过训练开源小型多模态模型(SMMs)来弥合生物医学领域未满足临床需求的能力差距。为最大化数据效率,我们采用模块化方法:集成图像和文本模态的最先进预训练模型,重点训练轻量级适配器将各模态映射至文本嵌入空间。我们在放射影像领域对该方法进行了系统研究。训练阶段,我们构建了包含超过100万对图像-文本的大规模数据集。评估阶段,我们提出基于GPT-4的临床驱动新方法,验证其与专家评估的一致性,并利用注意力机制进行定性基础分析。针对最佳实践,我们对数据工程与多模态训练的多种方案进行系统消融实验。最终提出的LLaVA-Rad(7B)模型在报告生成和跨模态检索等放射学任务上达到最优性能,甚至超越GPT-4V和Med-PaLM M(84B)等更大模型。该模型响应迅速,可在私有环境下单卡V100 GPU运行,为真实临床场景提供了有前景的最优工具。