Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
翻译:近年来,通用或领域专用的多模态大语言模型(LLMs)在医疗决策支持方面取得了显著进展。然而,现有模型通常针对特定分类或生成任务设计,且需在大规模数据集上进行参数量庞大、计算成本高昂的模型训练或微调,这限制了其在实践中多样化资源受限场景下的临床应用。本文提出一种新颖的轻量化框架Med-MoE(专家混合模型),该框架能同时处理判别式与生成式多模态医疗任务。Med-MoE的学习过程包含三个步骤:多模态医学对齐、指令调优与路由学习、领域专家混合调优。在将多模态医学图像与LLM词元对齐后,我们通过指令调优使模型适应不同的多模态医疗任务,并配备一个可训练的跨模态专家选择路由器。最后,通过将路由器与多个领域专用专家集成进行模型调优,这些专家由元专家动态选择激活。在VQA-RAD、SLAKE和Path-VQA等数据集上开展的开放式与封闭式医学视觉问答(Med-VQA)及图像分类任务的综合实验表明,我们的模型性能达到或超越了当前最先进的基线方法,而激活参数量仅需约30%-50%。深入的解析与消融实验验证了本方法的有效性与实用价值。