Meeting stringent Time-To-First-Token (TTFT) requirements is crucial for LLM applications. To improve efficiency, modern LLM serving systems adopt disaggregated architectures with diverse parallelisms, introducing complex multi-stage workflows involving reusable KV-block retrieval, collective communication, and P2D transfer. Flows from dependent stages overlap within and across requests on shared bottleneck links, making TTFT highly susceptible to network contention and necessitating stage-aware scheduling. Unfortunately, most existing works schedule flows in a stage-agnostic manner, leading to uncoordinated contention that constitutes a primary cause of SLO violations. In this paper, we present MFS, a holistic multi-stage flow scheduling mechanism designed to maximize TTFT SLO attainment. At its core, MFS approximates the Least-Laxity-First (LLF) scheduling policy without requiring precise knowledge of a request's remaining slack. It achieves this through a Defer-and-Promote principle implemented through a Reverse Multi-Level Queue (RMLQ) structure. By dynamically promoting task precedence as effective laxity diminishes, MFS prioritizes flows with less laxity while preventing requests with loose SLOs from prematurely consuming network bandwidth. We implement MFS as a pluggable module integrated into vLLM, and evaluate it on a 8-server, 32-GPU testbed as well as through large-scale simulations. Our results demonstrate that MFS effectively outperforms state-of-the-art baselines, improving the TTFT SLO attainment by 1.2x--2.4x.
翻译:满足严格的首次令牌时间要求对于大语言模型应用至关重要。为提高效率,现代大语言模型服务系统采用解耦架构与多样化并行策略,引入了涉及可复用键值块检索、集合通信与点对数据传输的复杂多阶段工作流。来自依赖阶段的流在共享瓶颈链路上跨请求内外重叠,使得首次令牌时间极易受网络争用影响,因此需要具备阶段感知能力的调度机制。然而,现有研究大多采用与阶段无关的流调度方式,导致不协调的争用成为服务等级目标违规的主要原因。本文提出MFS——一种旨在最大化首次令牌时间服务等级目标达成率的整体多阶段流调度机制。其核心在于,无需精确知晓请求的剩余松弛时间,即可近似实现最小松弛度优先调度策略。这是通过基于反向多级队列结构实现的延迟-提升原则达成的。通过随有效松弛度减少而动态提升任务优先级,MFS优先调度松弛度较小的流,同时防止宽松服务等级目标的请求过早占用网络带宽。我们将MFS实现为可插拔模块集成至vLLM,并在8服务器、32GPU测试平台及大规模仿真环境中进行评估。实验结果表明,MFS显著优于现有先进基线方法,将首次令牌时间服务等级目标达成率提升1.2至2.4倍。