Multi-Layer Scheduling for MoE-Based LLM Reasoning

Large Language Models (LLMs) have achieved remarkable success across a wide range of tasks, but serving them efficiently at scale remains a critical challenge due to their substantial computational and latency demands. While most existing inference frameworks rely on simple scheduling strategies such as First-Come-First-Serve (FCFS) at the engine level and Round-Robin (RR) at the scheduler or coordinator level, they often fail to fully utilize system resources and may suffer from issues such as head-of-line blocking and load imbalance. Recent advances in Mixture-of-Experts (MoE) models have also introduced new challenges in scheduling arising from expert parallelism and routing complexity. This research proposes a multi-layer scheduling framework tailored for MoE-based LLM serving. It targets scheduling at three levels: request-level, enginelevel, and expert-level. At the request level, we explore algorithms such as Shortest-Job-First (SJF) and priority-aware aging to improve throughput and reduce latency. At the engine level, we design load-aware dispatching strategies that account for the current prefix token load, KV cache utilization, and user stickiness to achieve better resource matching. At the expert level, we focus on alleviating expert hotspots and strategically placing inter-layer expert dependencies to balance load and improve routing efficiency. Extensive experimental results from more than 100 experiments conducted under diverse workload distributions show that our approach consistently outperforms the state-of-theart inference framework vLLM, achieving up to 17.8% reduction in Time To First Token (TTFT) latency and 13.3% reduction in Time-Per-Output-Token (TPOT) latency.

翻译：大语言模型（LLMs）已在众多任务中取得显著成功，然而由于其巨大的计算与延迟开销，实现高效的大规模服务部署仍面临关键挑战。现有推理框架大多采用简单的调度策略，例如在引擎层使用先到先服务（FCFS），在调度器或协调器层使用轮询（RR），这些策略往往无法充分利用系统资源，并可能遭受队头阻塞和负载不均等问题。近期混合专家（MoE）模型的发展也因专家并行与路由复杂性带来了新的调度挑战。本研究提出一种专为基于MoE的LLM服务设计的**多层调度框架**，其调度目标涵盖三个层级：请求层、引擎层与专家层。在请求层，我们探索了最短作业优先（SJF）及优先级感知老化等算法，以提升吞吐并降低延迟。在引擎层，我们设计了考虑当前前缀词元负载、KV缓存利用率及用户粘性的负载感知分发策略，以实现更优的资源匹配。在专家层，我们着重缓解专家热点问题，并通过策略性地安排层间专家依赖关系来平衡负载并提升路由效率。基于多样化工作负载分布下开展的超过100组实验结果表明，本方法在各项指标上持续优于当前最先进的推理框架vLLM，实现了高达17.8%的首词元延迟（TTFT）降低与13.3%的单输出词元时间（TPOT）降低。