Serving Mixture-of-Experts (MoE) large language models (LLMs) is challenging because dynamic request workloads interact with sparse expert routing, creating both data-parallel (DP) engine imbalance and expert-level hotspots. Existing LLM serving systems typically make these decisions in isolation: frontend schedulers route requests using coarse request counters, while backend expert balancers rely mainly on aggregate expert activation counts. This separation prevents the serving system from reacting to fine-grained engine pressure, backend MoE pressure, and source-dependent expert traffic. To address this gap, we propose Gimbal, a coordinated cross-level scheduling system for efficient MoE-based LLM serving. First, Gimbal presents a fine-grained DP-engine scheduler that uses online backend pressure signals, including key-value (KV) cache usage, remaining prefill work, queue pressure, and MoE expert pressure, to dispatch requests away from overloaded engines. Inside each engine, Gimbal further applies a lightweight prefill-aware queue ordering policy with aging to reduce head-of-line blocking without output-length prediction. Second, Gimbal extends expert load balancing with online source-DP-to-expert routing statistics and uses a heuristic guided by a mixed-integer nonlinear program (MINLP) to place experts while jointly considering expert load, source-aware communication, and migration stability. Our evaluation shows that Gimbal reduces average Time To First Token (TTFT) by 42.9% and average Time Per Output Token (TPOT) by 33.3% compared with the state-of-the-art serving system vLLM, while improving high-load request throughput by 3.0%.
翻译:服务混合专家(MoE)大语言模型(LLM)极具挑战性,因为动态请求负载与稀疏专家路由相互作用,既造成数据并行(DP)引擎失衡,又引发专家级热点。现有LLM服务系统通常孤立地做出这些决策:前端调度器使用粗粒度的请求计数器路由请求,而后端专家均衡器则主要依赖聚合的专家激活计数。这种分离使得服务系统无法感知细粒度的引擎压力、后端MoE压力以及来源相关的专家流量。为解决这一问题,我们提出Gimbal——一种面向高效MoE LLM服务的协同跨层级调度系统。首先,Gimbal提出细粒度DP引擎调度器,利用在线后端压力信号(包括键值缓存利用率、剩余预填充工作量、队列压力和MoE专家压力)将请求从过载引擎分散调度出去。在每个引擎内部,Gimbal进一步采用轻量级预填充感知队列排序策略(结合老化机制),无需输出长度预测即可减少队头阻塞。其次,Gimbal通过在线源DP到专家的路由统计信息扩展专家负载均衡,并采用混合整数非线性规划(MINLP)启发式算法放置专家,同时兼顾专家负载、源感知通信和迁移稳定性。评估表明,与最先进服务系统vLLM相比,Gimbal将平均首Token延迟(TTFT)降低42.9%,平均每个输出Token时间(TPOT)降低33.3%,同时将高负载请求吞吐量提升3.0%。