Coordinated Scheduling for MoE LLM Serving

Serving Mixture-of-Experts (MoE) large language models (LLMs) is challenging because dynamic request workloads interact with sparse expert routing, creating both data-parallel (DP) engine imbalance and expert-level hotspots. Existing LLM serving systems typically make these decisions in isolation: frontend schedulers route requests using coarse request counters, while backend expert balancers rely mainly on aggregate expert activation counts. This separation prevents the serving system from reacting to fine-grained engine pressure, backend MoE pressure, and source-dependent expert traffic. To address this gap, we propose Gimbal, a coordinated cross-level scheduling system for efficient MoE-based LLM serving. First, Gimbal presents a fine-grained DP-engine scheduler that uses online backend pressure signals, including key-value (KV) cache usage, remaining prefill work, queue pressure, and MoE expert pressure, to dispatch requests away from overloaded engines. Inside each engine, Gimbal further applies a lightweight prefill-aware queue ordering policy with aging to reduce head-of-line blocking without output-length prediction. Second, Gimbal extends expert load balancing with online source-DP-to-expert routing statistics and uses a heuristic guided by a mixed-integer nonlinear program (MINLP) to place experts while jointly considering expert load, source-aware communication, and migration stability. Our evaluation shows that Gimbal reduces average Time To First Token (TTFT) by 42.9% and average Time Per Output Token (TPOT) by 33.3% compared with the state-of-the-art serving system vLLM, while improving high-load request throughput by 3.0%.

翻译：服务混合专家（MoE）大语言模型（LLM）极具挑战性，因为动态请求负载与稀疏专家路由相互作用，既造成数据并行（DP）引擎失衡，又引发专家级热点。现有LLM服务系统通常孤立地做出这些决策：前端调度器使用粗粒度的请求计数器路由请求，而后端专家均衡器则主要依赖聚合的专家激活计数。这种分离使得服务系统无法感知细粒度的引擎压力、后端MoE压力以及来源相关的专家流量。为解决这一问题，我们提出Gimbal——一种面向高效MoE LLM服务的协同跨层级调度系统。首先，Gimbal提出细粒度DP引擎调度器，利用在线后端压力信号（包括键值缓存利用率、剩余预填充工作量、队列压力和MoE专家压力）将请求从过载引擎分散调度出去。在每个引擎内部，Gimbal进一步采用轻量级预填充感知队列排序策略（结合老化机制），无需输出长度预测即可减少队头阻塞。其次，Gimbal通过在线源DP到专家的路由统计信息扩展专家负载均衡，并采用混合整数非线性规划（MINLP）启发式算法放置专家，同时兼顾专家负载、源感知通信和迁移稳定性。评估表明，与最先进服务系统vLLM相比，Gimbal将平均首Token延迟（TTFT）降低42.9%，平均每个输出Token时间（TPOT）降低33.3%，同时将高负载请求吞吐量提升3.0%。