In production environments, large language model (LLM) serving is required to meet stringent service-level objectives (SLOs) amid highly variable request patterns. In practice, request lengths follow a long-tail distribution, which gives rise to head-of-line blocking on the prefill side and underutilization caused by stragglers on the decode side in disaggregated serving architectures. Current systems, which adopt first-come-first-served (FCFS) scheduling for prefill and continuous batching for decode, lack the ability to adapt to this imbalance, resulting in compromised SLO attainment and reduced throughput. To address these challenges, we propose Kairos, an SLO-aware scheduling system equipped with two complementary mechanisms. On the prefill side, Kairos employs urgency-based priority scheduling: it predicts prefill completion times and dynamically selects requests to maximize the attainment of time-to-first-token (TTFT) SLOs. On the decode side, Kairos introduces slack-guided adaptive batching, which leverages the gap between per-step decode time and the time-per-output-token (TPOT) SLO to greedily pack short requests. This approach maximizes throughput while strictly adhering to SLO requirements. We implement Kairos and conduct evaluations using an online serving dataset and a state-of-the-art LLM. Experimental results demonstrate that, compared with state-of-the-art baselines, Kairos improves TTFT SLO attainment by up to 23.9\%, TPOT SLO attainment by up to 27.1\%, end-to-end SLO attainment by up to 33.8\%, and decode throughput by up to 19.3\%.
翻译:在生产环境中,大语言模型(LLM)服务需在高度变化的请求模式下满足严格的服务等级目标(SLO)。实际请求长度呈长尾分布,导致解耦式服务架构中预填充阶段出现队头阻塞,以及解码阶段因延迟请求引发的资源利用率不足问题。现有系统采用先来先服务(FCFS)调度预填充和连续批处理解码的机制,缺乏应对这种不均衡的能力,导致SLO达标率降低与吞吐量下降。为应对这些挑战,我们提出Kairos——配备两种互补机制的SLO感知调度系统。在预填充侧,Kairos采用基于紧迫性的优先级调度:预测预填充完成时间并动态选择请求,最大化首个令牌生成时间(TTFT)SLO的达标率。在解码侧,Kairos引入松弛引导的自适应批处理,利用单步解码时间与每输出令牌时间(TPOT)SLO之间的间隙贪婪地打包短请求。该方法在严格遵循SLO要求的同时最大化吞吐量。我们实现了Kairos,并使用在线服务数据集和先进LLM进行评估。实验结果表明,与最先进基线相比,Kairos将TTFT SLO达标率提升最高23.9%,TPOT SLO达标率提升最高27.1%,端到端SLO达标率提升最高33.8%,解码吞吐量提升最高19.3%。