Recent advancements in mixed-modal generative models have enabled flexible integration of information across image-text content. These models have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and predicting the impact of medical procedures on a patient's health. However, existing resources face challenges such as limited data availability, narrow domain coverage, and restricted sources (e.g., medical papers). To address these gaps, we present MedMax, the first large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including multimodal content generation (interleaved image-text data), biomedical image captioning and generation, visual chatting, and report understanding. These tasks span diverse medical domains such as radiology and histopathology. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Additionally, we introduce a unified evaluation suite for biomedical tasks, providing a robust framework to guide the development of next-generation mixed-modal biomedical AI assistants.
翻译:混合模态生成模型的最新进展使得图像-文本内容的信息能够灵活整合。这些模型为开发统一的生物医学助手开辟了新途径,这些助手能够分析生物医学图像、回答关于这些图像的复杂问题,并预测医疗程序对患者健康的影响。然而,现有资源面临数据可用性有限、领域覆盖狭窄以及来源受限(例如医学论文)等挑战。为弥补这些不足,我们提出了MedMax,这是首个面向混合模态基础模型的大规模多模态生物医学指令调优数据集。MedMax包含147万个实例,涵盖多样化的任务,包括多模态内容生成(交错图像-文本数据)、生物医学图像描述与生成、视觉对话以及报告理解。这些任务跨越放射学和组织病理学等多个医学领域。随后,我们在MedMax数据集上对混合模态基础模型进行微调,取得了显著的性能提升:在12个下游生物医学视觉问答任务中,相比Chameleon模型提升了26%,相比GPT-4o提升了18.3%。此外,我们引入了针对生物医学任务的统一评估套件,为下一代混合模态生物医学人工智能助手的开发提供了稳健的指导框架。