Current medical artificial intelligence systems are often limited to narrow applications, hindering their widespread adoption in clinical practice. To address this limitation, we propose MedVersa, a generalist learner that enables flexible learning and tasking for medical image interpretation. By leveraging a large language model as a learnable orchestrator, MedVersa can learn from both visual and linguistic supervision, support multimodal inputs, and perform real-time task specification. This versatility allows MedVersa to adapt to various clinical scenarios and perform multifaceted medical image analysis. We introduce MedInterp, the largest multimodal dataset to date for medical image interpretation, consisting of over 13 million annotated instances spanning 11 tasks across 3 modalities, to support the development of MedVersa. Our experiments demonstrate that MedVersa achieves state-of-the-art performance in 9 tasks, sometimes outperforming specialist counterparts by over 10%. MedVersa is the first to showcase the viability of multimodal generative medical AI in implementing multimodal outputs, inputs, and dynamic task specification, highlighting its potential as a multifunctional system for comprehensive medical image analysis. This generalist approach to medical image interpretation paves the way for more adaptable and efficient AI-assisted clinical decision-making.
翻译:当前医学人工智能系统往往局限于狭窄应用,阻碍其在临床实践中的广泛采用。为解决这一局限,我们提出MedVersa——一种能够实现医学影像理解灵活学习与任务执行的通用型学习器。通过利用大型语言模型作为可学习编排器,MedVersa既能从视觉和语言监督中学习,支持多模态输入,还能进行实时任务指定。这种多功能性使MedVersa能够适应多种临床场景,执行多维度医学影像分析。我们引入MedInterp——迄今最大的医学影像理解多模态数据集,包含超过1300万个标注实例,覆盖3种模态下的11项任务,为MedVersa的开发提供支持。实验证明,MedVersa在9项任务中实现了最先进性能,有时甚至比专业对应系统性能高出10%以上。MedVersa首次展示了多模态生成式医学AI在实现多模态输出、输入及动态任务指定方面的可行性,突显其作为多维度医学影像分析综合系统的潜力。这种医学影像理解的通用方法为更适应、更高效的AI辅助临床决策开辟了新路径。