AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7x and 3.3x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.

翻译：随着搭载外部工具的增强型大语言模型（LLM）在Web应用中日益普及，提升增强型LLM推理服务效率并优化服务级别目标（SLO）对于改善用户体验至关重要。为实现这一目标，推理系统必须在延迟约束下最大化请求处理能力，即提升有效吞吐量。然而，现有系统面临两大挑战：（i）依赖先到先服务（FCFS）调度导致严重的队头阻塞，致使大量请求的排队延迟超出SLO；（ii）静态的批处理令牌限制无法适应波动的负载与硬件条件。这两大因素均会降低有效吞吐量及服务质量。本文提出AugServe，一种高效的推理框架，旨在降低增强型LLM推理服务的排队延迟并提升有效吞吐量。AugServe的核心思想是采用两阶段自适应请求调度策略。具体而言，AugServe结合增强型LLM请求的推理特征以优化调度决策顺序（第一阶段），并利用运行时信息持续优化这些决策（第二阶段），从而同时适应请求特性与系统能力。此外，AugServe基于硬件状态与实时负载动态调整令牌批处理机制，进一步提升吞吐性能。实验结果表明，相较于vLLM与InferCept，AugServe分别实现了4.7倍与3.3倍的有效吞吐量提升，同时将首令牌生成时间（TTFT）最高降低了96.3%与95.0%。