Multimodal LLM With Hierarchical Mixture-of-Experts for VQA on 3D Brain MRI

Multiparametric 3D brain MRI (mpMRI) is central to neuroradiology, but producing tumor location, appearance, size, and involvement of critical structures for neurosurgical planning remains challenging. We introduce mpLLM, a multimodal LLM for visual question answering (VQA) on mpMRI that produces clinically interpretable tumor descriptors (e.g., volume, morphology, extent, and coarse localization) as an adjunct to clinical expertise for referring neurosurgeons. mpLLM uses a prompt-conditioned hierarchical mixture-of-experts (MoE) to fuse multiple 3D sequences via routing over modality- and token-level projection experts, enabling data-efficient end-to-end training without large-scale image-report pretraining. To address limited paired image-text supervision, we propose a synthetic VQA protocol that derives clinically grounded questions and answers from expert segmentation annotations and is validated with radiologist collaboration. Across multiple mpMRI datasets, mpLLM improves over strong medical VLM baselines by +5.5 points on average (+9.1% relative) and increases radiologist-rated clinical acceptability by +15.9 points (+46.6% relative). Our study features three main contributions: (1) the first VQA dataset for 3D brain mpMRI, (2) a hierarchical MoE architecture for joint reasoning over interrelated 3D sequences, and (3) expert-supported evidence of clinical utility. Source code is available at https://github.com/arvindmvepa/mpllm, and we will release the dataset upon publication.

翻译：多参数三维脑部磁共振成像（mpMRI）是神经放射学的核心检查手段，但为神经外科手术规划生成肿瘤位置、形态、尺寸及关键结构受累情况描述仍具挑战性。本文提出mpLLM——一种面向mpMRI视觉问答（VQA）的多模态大语言模型，能够生成临床可解释的肿瘤描述符（如体积、形态、范围及粗略定位），作为转诊神经外科医生临床专业知识的辅助工具。mpLLM采用提示条件化的分层专家混合（MoE）机制，通过模态级与标记级投影专家的路由决策融合多组三维序列，无需大规模影像-报告预训练即可实现数据高效的端到端训练。针对配对影像-文本监督数据有限的问题，我们提出一种合成VQA方案，该方案从专家分割标注中推导具有临床依据的问答对，并已通过放射科医师协作验证。在多个mpMRI数据集上的实验表明，mpLLM相较于现有强效医学视觉语言模型基线平均提升5.5个百分点（相对提升9.1%），放射科医师评定的临床可接受度提高15.9个百分点（相对提升46.6%）。本研究包含三项主要贡献：（1）首个三维脑部mpMRI视觉问答数据集；（2）面向多序列三维影像联合推理的分层MoE架构；（3）经专家支持的临床效用实证。源代码发布于https://github.com/arvindmvepa/mpllm，数据集将于论文发表后公开。