Recently, mixture of experts (MoE) has become a popular paradigm for achieving the trade-off between modal capacity and efficiency of multi-modal large language models (MLLMs). Different from previous efforts, we are dedicated to exploring the dynamic expert path in an already exist MLLM and show that a standard MLLM can be also a mixture of experts. To approach this target, we propose a novel dynamic expert scheme for MLLMs, termed Routing Experts (RoE), which can achieve example-dependent optimal path routing without obvious structure tweaks. Meanwhile, a new regularization of structure sparsity is also introduced to enforce MLLMs to learn more short-cut inference, ensuring the efficiency. In addition, we also realize the first attempt of aligning the training and inference schemes of MLLMs in terms of network routing. To validate RoE, we apply it to a set of latest MLLMs, including LLaVA-1.5, LLaVA-HR and VILA, and conduct extensive experiments on a bunch of VL benchmarks. The experiment results not only show the great advantages of our RoE in improving MLLMs' efficiency, but also yield obvious advantages than MoE-LLaVA in both performance and speed, e.g., an average performance gain of 3.3% on 5 benchmarks while being faster.
翻译:近年来,专家混合(MoE)已成为实现多模态大语言模型(MLLMs)模态容量与效率平衡的主流范式。与先前研究不同,本文致力于探索现有MLLM中的动态专家路径,并证明标准MLLM同样可构建为专家混合体系。为实现此目标,我们提出一种创新的MLLM动态专家方案——路由专家(RoE),该方案能在不明显调整模型结构的前提下,实现基于样本的最优路径路由。同时,我们引入结构稀疏性正则化方法,强制MLLM学习更高效的捷径推理机制以保障运算效率。此外,本研究首次在网络路由层面实现了MLLM训练与推理方案的对齐。为验证RoE的有效性,我们将其应用于LLaVA-1.5、LLaVA-HR及VILA等前沿MLLM模型,并在系列视觉语言基准测试中开展广泛实验。实验结果不仅显示RoE在提升MLLM效率方面的显著优势,更在性能与速度上全面超越MoE-LLaVA——例如在5个基准测试中平均性能提升3.3%的同时实现更快的推理速度。