Mixture-of-Experts (MoE) Multimodal large language models (MLLMs) excel at vision-language tasks, but they suffer from high computational inefficiency. To reduce inference overhead, expert skipping methods have been proposed to deactivate redundant experts based on the current input tokens. However, we find that applying these methods-originally designed for unimodal large language models (LLMs)-to MLLMs results in considerable performance degradation. This is primarily because such methods fail to account for the heterogeneous contributions of experts across MoE layers and modality-specific behaviors of tokens within these layers. Motivated by these findings, we propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference. It incorporates a globally-modulated local gating (GMLG) mechanism that integrates global layer-wise importance into local routing probabilities to accurately estimate per-token expert importance. A dual-modality thresholding (DMT) method is then applied, which processes tokens from each modality separately, to derive the skipping schedule. To set the optimal thresholds, we introduce a frontier search algorithm that exploits monotonicity properties, cutting convergence time from several days to a few hours. Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches. For instance, when skipping 88% experts for Qwen3-VL-MoE-30B-A3B-Instruct, the performance boost is up to 10.67% (97.33% vs. 86.66%). Furthermore, MoDES significantly enhances inference speed, improving the prefilling time by 2.16$\times$ and the decoding time by 1.26$\times$. Our code is available at https://github.com/ModelTC/MoDES.
翻译:混合专家(MoE)多模态大语言模型(MLLMs)在视觉语言任务上表现出色,但其计算效率低下。为减少推理开销,已有研究提出专家跳过来方法,即根据当前输入令牌停用冗余专家。然而,我们发现将这类最初为单模态大语言模型(LLMs)设计的方法直接应用于MLLMs会导致显著的性能下降。这主要是因为此类方法未能考虑专家在不同MoE层间的异质性贡献以及令牌在这些层内的模态特异性行为。基于这些发现,我们提出了MoDES,这是首个无需训练的自适应专家跳过来框架,旨在实现高效且准确的MoE MLLM推理。该框架引入了一种全局调制局部门控(GMLG)机制,将全局层间重要性整合到局部路由概率中,以精确估计每个令牌的专家重要性。随后应用一种双模态阈值(DMT)方法,该方法分别处理来自每种模态的令牌,以生成跳过来调度方案。为设定最优阈值,我们提出了一种利用单调性特性的前沿搜索算法,将收敛时间从数天缩短至数小时。在涵盖13个基准测试的3个模型系列上进行的大量实验表明,MoDES的性能远超先前方法。例如,在Qwen3-VL-MoE-30B-A3B-Instruct模型中跳过88%的专家时,性能提升高达10.67%(97.33% vs. 86.66%)。此外,MoDES显著提升了推理速度,将预填充时间提高了2.16$\times$,解码时间提高了1.26$\times$。我们的代码可在https://github.com/ModelTC/MoDES获取。