With the rapid growth of interactive applications in large language model (LLM) online services, maintaining high system throughput while ensuring user-perceived latency has become a key issue in inference scheduling. Existing LLM service systems rely on coarse-grained output constraints, making it difficult to effectively handle resource contention among multiple requests, resulting in low resource utilization efficiency and limited support for fine-grained quality of service (QoS) differentiation. We present SlidingServe, a sliding-window-driven SLO-Aware scheduling system for online LLM inference. SlidingServe designed a lightweight batch latency predictor to estimate the execution time of a batch. Based on this, SlidingServe uses SlidingChunker to combine information from the current iteration and the next iteration to achieve dynamic chunking and improve the overall system throughput while maintaining strict QoS guarantees. SlidingServe introduces Multi-Level Priority Sorter to sort candidate requests in order to balance fairness and efficiency. Additionally, when multiple requests within the same batch are at risk of SLO violating,SlidingServe introduces BatchConstructor, which uses dynamic programming to select the set of requests to execute in the current round, mitigating the SLO violation risk of critical requests.Our evaluation demonstrates that SlidingServe can improve service capacity by up to 30% compared to advanced scheduling systems under various load conditions, and further reduces the rate of SLO violation by 16%-53% under heavy-load inference mode.
翻译:随着大语言模型在线服务中交互式应用的快速增长,如何在保证用户感知延迟的同时维持高系统吞吐量,已成为推理调度中的关键问题。现有LLM服务系统依赖粗粒度的输出约束,难以有效处理多个请求间的资源竞争,导致资源利用效率低下,且对细粒度服务质量差异化支持有限。我们提出SlidingServe——一种面向在线LLM推理的滑动窗口驱动型SLO感知调度系统。SlidingServe设计了一个轻量级批次延迟预测器,用于估算批次的执行时间。基于此,SlidingServe利用SlidingChunker整合当前迭代与下一次迭代的信息,实现动态分块,在维持严格QoS保障的同时提高系统整体吞吐量。SlidingServe引入多级优先级排序器,对候选请求进行排序以平衡公平性与效率。此外,当同一批次内多个请求面临SLO违反风险时,SlidingServe引入BatchConstructor,通过动态规划选择当前轮次执行的请求集,以减轻关键请求的SLO违反风险。我们的评估表明,在不同负载条件下,SlidingServe相较于先进调度系统可将服务容量提升高达30%,并在高负载推理模式下进一步将SLO违反率降低16%-53%。