Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference

Large-scale Mixture of Experts (MoE) Large Language Models (LLMs) have recently become the frontier open-weight models, achieving remarkable model capability similar to proprietary ones. But their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit LLM serving systems. To understand the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across four state-of-the-art large-scale MoE models released in 2025 (200B-1000B) using over 24,000 requests spanning diverse workloads. We perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse serving systems. We verify these insights on both future wafer-scale GPU architectures and existing GPU systems. On wafer-scale GPUs, lightweight architectural modifications guided by our insights yield a 6.6$\times$ average speedup across four 200B--1000B models. On existing GPU systems, our insights drive the design of a prefill-aware expert placement algorithm that achieves up to 1.25$\times$ speedup on MoE computation. Our work presents the first comprehensive data-centric analysis of large-scale MoE models together with a concrete design study applying the learned lessons. Our profiling traces are publicly available at \href{https://huggingface.co/datasets/core12345/MoE_expert_selection_trace}{\textcolor{blue}{https://huggingface.co/datasets/core12345/MoE\_expert\_selection\_trace}}.

翻译：大规模混合专家（MoE）大语言模型（LLMs）近期已成为前沿开源权重模型，其能力已媲美闭源模型。然而，其随机专家选择机制引入了显著的数据迁移开销，该开销已成为多单元大模型推理系统中的主要瓶颈。为揭示数据迁移背后的规律，我们对2025年发布的四种最先进大规模MoE模型（参数规模200B-1000B）进行了全面数据迁移分析，使用超过24,000个请求覆盖不同工作负载。我们从时间和空间两个维度展开系统分析，提炼出六项关键洞察用于指导多样化的推理系统设计。我们分别在未来晶圆级GPU架构和现有GPU系统上验证了这些洞察。在晶圆级GPU上，基于洞察的轻量级架构改进在四个200B-1000B模型上实现了平均6.6倍加速。在现有GPU系统上，我们的洞察驱动设计了预填充感知的专家放置算法，在MoE计算中实现了高达1.25倍加速。本研究首次对大规模MoE模型进行全面的数据层面分析，并给出应用所发现规律的具体设计案例。我们的分析轨迹已公开发布于 \href{https://huggingface.co/datasets/core12345/MoE_expert_selection_trace}{\textcolor{blue}{https://huggingface.co/datasets/core12345/MoE\_expert\_selection\_trace}}。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型高效推理中的动态模型路由与级联技术综述

专知会员服务

14+阅读 · 3月6日

【普林斯顿博士论文】大型模型的高效推理

专知会员服务

23+阅读 · 2025年8月10日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

LLM4SR：关于大规模语言模型在科学研究中的应用综述

专知会员服务

42+阅读 · 2025年1月9日