Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx--FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ($\it{e.g.}$, medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.
翻译:多模态大语言模型(MLLMs)在医疗应用中展现出巨大潜力,但现有基准测试未能充分捕捉真实世界的临床复杂性。我们提出了MEDSYN,一个多语言、多模态的基准测试集,包含高度复杂的临床病例,每个病例最多涉及7种不同的视觉临床证据(CE)类型。参照临床工作流程,我们对18个MLLM在鉴别诊断(DDx)生成和最终诊断(FDx)选择两项任务上进行了评估。虽然顶尖模型在DDx生成上常能达到甚至超越人类专家水平,但所有MLLM在DDx与FDx之间的性能差距均远大于临床专家,这表明模型在合成异质CE类型方面存在一种失效模式。消融实验将此失效归因于:(i)过度依赖区分度较低的文本型CE(例如,病史),以及(ii)跨模态CE利用差距。我们引入了证据敏感度来量化后者,并证明较小的差距与更高的诊断准确性相关。最后,我们展示了如何利用该指标指导干预措施以提升模型性能。我们将开源我们的基准测试集与代码。