Brain imaging analysis is crucial for diagnosing and treating brain disorders, and multimodal large language models (MLLMs) are increasingly supporting it. However, current brain imaging visual question-answering (VQA) benchmarks either cover a limited number of imaging modalities or are restricted to coarse-grained pathological descriptions, hindering a comprehensive assessment of MLLMs across the full clinical continuum. To address these, we introduce OmniBrainBench, the first comprehensive multimodal VQA benchmark specifically designed to assess the multimodal comprehension capabilities of MLLMs in brain imaging analysis with closed- and open-ended evaluations. OmniBrainBench comprises 15 distinct brain imaging modalities collected from 30 verified medical sources, yielding 9,527 validated VQA pairs and 31,706 images. It simulates clinical workflows and encompasses 15 multi-stage clinical tasks rigorously validated by a professional radiologist. Evaluations of 24 state-of-the-art models, including open-source general-purpose, medical, and proprietary MLLMs, highlight the substantial challenges posed by OmniBrainBench. Experiments reveal that proprietary MLLMs like GPT-5 (63.37%) outperform others yet lag far behind physicians (91.35%), while medical ones show wide variance in closed- and open-ended VQA. Open-source general-purpose MLLMs generally trail but excel in specific tasks, and all ones fall short in complex preoperative reasoning, revealing a critical visual-to-clinical gap. OmniBrainBench establishes a new standard to assess MLLMs in brain imaging analysis, highlighting the gaps against physicians. We publicly release our benchmark at link.
翻译:脑成像分析对于脑部疾病的诊断与治疗至关重要,而多模态大语言模型正日益为其提供支持。然而,当前的脑成像视觉问答基准要么涵盖的成像模态数量有限,要么仅限于粗粒度的病理描述,阻碍了对多模态大语言模型在整个临床连续体上的全面评估。为解决这些问题,我们提出了OmniBrainBench,这是首个全面的多模态视觉问答基准,专门设计用于通过封闭式和开放式评估来评估多模态大语言模型在脑成像分析中的多模态理解能力。OmniBrainBench包含从30个已验证医学来源收集的15种不同的脑成像模态,产生了9,527个经过验证的视觉问答对和31,706张图像。它模拟了临床工作流程,并涵盖了由专业放射科医师严格验证的15项多阶段临床任务。对24个最先进模型的评估,包括开源通用模型、医学专用模型以及专有多模态大语言模型,突显了OmniBrainBench带来的重大挑战。实验表明,像GPT-5(63.37%)这样的专有模型优于其他模型,但仍远落后于医师(91.35%),而医学模型在封闭式和开放式视觉问答中表现出较大差异。开源通用多模态大语言模型总体上落后,但在特定任务中表现出色,并且所有模型在复杂的术前推理方面均存在不足,揭示了一个关键的视觉到临床的差距。OmniBrainBench为评估脑成像分析中的多模态大语言模型设立了新标准,并凸显了其与医师之间的差距。我们已在链接处公开了我们的基准。