Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. With our insights, we then demonstrate how to improve wafer-scale GPUs as a case study, and show that minor architectural modifications leveraging the insights achieve substantial performance gains, delivering 5.3x and 3.1x average speedups on DeepSeek V3 and Qwen3, respectively. Our work presents the first comprehensive data-centric analysis of large-scale MoE models and a concrete design study using the learned lessons, with profiling traces and simulation framework already open-sourced with $>$1k downloads. Our traces and results are publicly available at https://huggingface.co/datasets/core12345/MoE_expert_selection_trace
翻译:大规模专家混合模型(MoE)大语言模型(LLMs)近期已成为前沿的开源权重模型,实现了与专有模型相媲美的卓越模型能力。但其随机专家选择机制引入了显著的数据移动开销,这已成为多单元LLM服务系统中的主要瓶颈。为理解这种数据移动背后的模式,我们使用涵盖多样化工作负载的超过24,000个请求,对2025年发布的四种最先进的大规模MoE模型(200B-1000B)进行了全面的以数据移动为中心的性能剖析。我们从时间和空间两个角度进行了系统性分析,并提炼出六项关键洞见,以指导未来多样化服务系统的设计。基于这些洞见,我们随后以改进晶圆级GPU为例进行案例研究,结果表明,利用这些洞见进行的微小架构修改能实现显著的性能提升,在DeepSeek V3和Qwen3上分别实现了平均5.3倍和3.1倍的加速。我们的工作首次提出了对大规模MoE模型的全面数据驱动分析,并利用所获经验进行了具体的设计研究,相关剖析轨迹和仿真框架已开源,下载量超过1000次。我们的轨迹和结果公开于 https://huggingface.co/datasets/core12345/MoE_expert_selection_trace