The deployment of large language models (LLMs) in real-world clinical applications is constrained by the fundamental trade-off between computational cost and the efficiency of linear-time models. To address this, we propose an LLM-based MambaFormer hybrid Mixture-of-Experts (MoE) framework for efficient medical question-answering (QA) and clinical assistance. The MambaFormer employs a lightweight gating mechanism that performs token-level dynamic routing to a customized Transformer expert (ET5) for short, complex queries or to a State Space Model expert (EMamba) for long, high-throughput sequences. The customized EMamba and ET5 models are tailored to accommodate input sequence dimensionality, embedding structure, sequence length, and target-specific output heads, and are fine-tuned through transfer learning on a new, custom-designed DentalQA dataset. Moreover, intelligent routing decisions are driven by the contextual complexity of token embeddings, normalized sequence length, and domain-aware features, thereby enforcing a Pareto-optimal trade-off between inference latency and prediction accuracy. Furthermore, a novel utility-guided multi-objective loss jointly optimizes decisions, router parameters, routing behavior, expert utilization, and computational cost by adaptively regulating token-level expert activation. Finally, the proposed MambaFormer is cross-validated (holdout) for medical QA on the new, custom-designed DentalQA and PubMedQA datasets and compared with state-of-the-art techniques. The proposed MambaFormer outperforms (BERTScore = 0.9180) with ultra-low latency (0.077 s), delivering a 24.4 speedup over T5-Large and establishing a scalable solution for resource-constrained clinical deployment.
翻译:在现实世界临床应用中部署大型语言模型(LLM)受到计算成本与线性时间模型效率之间基本权衡的限制。为解决此问题,我们提出一种基于LLM的MambaFormer混合专家(MoE)框架,用于高效医疗问答(QA)与临床辅助。MambaFormer采用轻量级门控机制,对简短复杂查询执行令牌级动态路由至定制Transformer专家(ET5),或对长序列高吞吐量输入路由至状态空间模型专家(EMamba)。定制的EMamba与ET5模型经专门设计以适应输入序列维度、嵌入结构、序列长度及任务特定输出头,并通过迁移学习在新构建的定制DentalQA数据集上进行微调。此外,智能路由决策由令牌嵌入的上下文复杂度、归一化序列长度及领域感知特征驱动,从而在推理延迟与预测精度间实现帕累托最优权衡。进一步地,新颖的效用引导多目标损失函数通过自适应调节令牌级专家激活,联合优化决策、路由器参数、路由行为、专家利用率及计算成本。最后,所提出的MambaFormer在新构建的定制DentalQA与PubMedQA数据集上针对医疗QA任务进行交叉验证(留出法),并与前沿技术进行比较。实验表明,MambaFormer以超低延迟(0.077秒)获得优异性能(BERTScore = 0.9180),相较T5-Large实现24.4倍加速,为资源受限的临床部署提供了可扩展解决方案。