Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single-line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment. Code implementation of SERE can be found in https://github.com/JL-Cheng/SERE.
翻译:混合专家(Mixture-of-Experts, MoE)架构通过稀疏激活机制,相比稠密大语言模型能以更高准确率实现更快的训练与推理。然而,在生产环境部署中,MoE模型需要进行批量推理以优化硬件效率,这可能导致专家激活过度,从而拖慢内存受限的解码阶段。为缓解批量解码与专家稀疏性之间的根本矛盾,本文提出SERE——一种面向MoE模型高效批量解码的基于相似性的专家重路由方法。SERE通过将次要专家中的令牌重定向至最相似的主专家,以输入感知的方式动态减少激活专家数量。该方法同时利用相似性模式识别并保留关键专家,从而避免能力损失。值得注意的是,SERE避免静态的专家剪枝或合并,而是基于批次级专家冗余实现动态专家跳过。此外,我们为SERE提供了高效的自定义CUDA内核,使其在vLLM中仅需单行代码修改即可实现即插即用。在多种复杂推理基准测试上的大量实验表明,SERE在保证质量损失最小的前提下,最高可实现2.0倍的加速比,为成本敏感且延迟敏感的大规模MoE部署提供了实用解决方案。SERE的代码实现可见于https://github.com/JL-Cheng/SERE。