Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM.
翻译:扩散大语言模型(dLLMs)因其固有的并行解码支持而近期受到广泛关注。基于此范式,采用自回归(AR)初始化的混合专家(MoE)dLLMs进一步展现出与主流AR模型相竞争的强大性能。然而,我们发现MoE架构与基于扩散的解码之间存在根本性不匹配:每个去噪步骤会激活大量专家,而最终仅接受少量令牌,导致显著的推理开销并限制了其在延迟敏感应用中的部署。本文提出TEAM——一种即插即用框架,通过以更少的激活专家实现更多被接受令牌,从而加速MoE dLLMs。该方法的动机源于观察到专家路由决策在去噪层级间具有强时间一致性,同时在令牌位置间具有空间一致性。基于这些特性,TEAM采用三种互补的专家激活与解码策略:保守选择解码令牌与掩码令牌所需的必要专家,同时对多个候选令牌进行激进的推测性探索。实验结果表明,TEAM相比原始MoE dLLM可实现最高2.2倍的加速,且性能下降可忽略不计。代码已发布于https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM。