Most recent state-of-the-art (SOTA) large language models (LLMs) use Mixture-of-Experts (MoE) architectures to scale model capacity without proportional per-token compute, enabling higher-quality outputs at manageable serving costs. However, MoE inference at scale is fundamentally bottlenecked by expert load imbalance and inefficient token routing, especially in multi-node deployments where tokens are not guaranteed to be routed to local experts, resulting in significant inter-node all-to-all communication overhead. To systematically characterize these challenges, we profile SOTA open-source MoE models, including Llama 4 Maverick, DeepSeek V3-671B, and Qwen3-230B-A22B, on various datasets and collected over 100k real expert activation traces. Upon studying the expert activation patterns, we uncover various persistent properties across all the frontier MoE models: variable expert load imbalance, domain-specific expert activation where expert popularity shifts across task families (code, math, chat, general), and a strong correlation between prefill and decode expert activations. Motivated by these findings, we propose workload-aware micro-batch grouping and an expert placement strategy to maximize token locality to the destination expert, thereby reducing inter-node communication. Across models and datasets, these optimizations help reduce all2all communication data up to 20, resulting in lower MoE decode latency and better accelerator utilization.
翻译:最新的顶尖大语言模型大多采用混合专家(MoE)架构,在不按比例增加每令牌计算量的前提下扩展模型容量,从而以可控的服务成本实现更高质量的输出。然而,大规模MoE推理的根本瓶颈在于专家负载不平衡和低效的令牌路由,尤其是在多节点部署场景下,令牌无法保证被路由至本地专家,从而产生显著的节点间全连接通信开销。为系统性地刻画这些挑战,我们对包括Llama 4 Maverick、DeepSeek V3-671B和Qwen3-230B-A22B在内的顶尖开源MoE模型进行了性能剖析,并在多种数据集上收集了超过10万条真实专家激活轨迹。通过研究专家激活模式,我们在所有前沿MoE模型中发现了多种持久性规律:变化的专家负载不均衡性、领域特定的专家激活(即专家热度随任务类型(代码、数学、对话、通用)而变化),以及预填充与解码专家激活之间的强相关性。基于这些发现,我们提出了负载感知的微批分组策略和专家放置策略,以最大化令牌到目标专家的本地性,从而减少节点间通信。在多种模型和数据集上,这些优化方法可将全连接通信数据量减少高达20倍,从而降低MoE解码时延并提升加速器利用率。