Recently, mixture of experts (MoE) has become a popular paradigm for achieving the trade-off between modal capacity and efficiency of multi-modal large language models (MLLMs). Different from previous efforts, we are dedicated to exploring the dynamic expert path in an already exist MLLM and show that a standard MLLM can be also a mixture of experts. To approach this target, we propose a novel dynamic expert scheme for MLLMs, termed Routing Experts (RoE), which can achieve example-dependent optimal path routing without obvious structure tweaks. Meanwhile, a new regularization of structure sparsity is also introduced to enforce MLLMs to learn more short-cut inference, ensuring the efficiency. In addition, we also realize the first attempt of aligning the training and inference schemes of MLLMs in terms of network routing. To validate RoE, we apply it to a set of latest MLLMs, including LLaVA-1.5, LLaVA-HR and VILA, and conduct extensive experiments on a bunch of VL benchmarks. The experiment results not only show the great advantages of our RoE in improving MLLMs' efficiency, but also yield obvious advantages than MoE-LLaVA in both performance and speed, e.g., an average performance gain of 3.3% on 5 benchmarks while being faster.
翻译:近年来,专家混合(MoE)已成为实现多模态大语言模型(MLLMs)模态容量与效率权衡的主流范式。与先前研究不同,本文致力于探索现有MLLM中的动态专家路径,并证明标准MLLM同样可构建为专家混合体系。为此,我们提出一种创新的MLLM动态专家方案——路由专家(RoE),该方案能在不明显调整模型结构的前提下,实现样本依赖的最优路径路由。同时,我们引入结构稀疏性正则化方法,迫使MLLMs学习更具捷径性的推理过程以保障效率。此外,本研究首次在网络路由层面实现了MLLMs训练与推理机制的对齐。为验证RoE的有效性,我们将其应用于LLaVA-1.5、LLaVA-HR及VILA等前沿MLLMs,并在系列视觉语言基准测试中开展广泛实验。实验结果不仅显示RoE在提升MLLMs效率方面的显著优势,更在性能与速度上全面超越MoE-LLaVA——例如在5个基准测试中平均性能提升3.3%的同时实现更快的推理速度。