AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

As augmented large language models (LLMs) with external tools become increasingly popular in web applications, improving augmented LLM inference serving efficiency and optimizing service-level objectives (SLOs) are critical for enhancing user experience. To achieve this, inference systems must maximize request handling within latency constraints, referred to as increasing effective throughput. However, existing systems face two major challenges: (i) reliance on first-come-first-served (FCFS) scheduling causes severe head-of-line blocking, leading to queuing delays exceeding the SLOs for many requests; and (ii) static batch token limit, which fails to adapt to fluctuating loads and hardware conditions. Both of these factors degrade effective throughput and service quality. This paper presents AugServe, an efficient inference framework designed to reduce queueing latency and enhance effective throughput for augmented LLM inference services. The core idea of AugServe is a two-stage adaptive request scheduling strategy. Specifically, AugServe combines the inference features of augmented LLM requests to optimize the order of scheduling decisions (stage I). These decisions are continuously refined with runtime information (stage II), adapting to both request characteristics and system capabilities. In addition, AugServe dynamically adjusts the token batching mechanism based on hardware status and real-time load, further enhancing throughput performance. Experimental results show that AugServe achieves 4.7-33.1x and 3.3-13.2x higher effective throughput than vLLM and InferCept, while reducing time-to-first-token (TTFT) by up to 96.3% and 95.0%, respectively.

翻译：随着结合外部工具的增强型大语言模型（LLM）在Web应用中日益普及，提升增强型LLM推理服务效率并优化服务级目标（SLO）对于改善用户体验至关重要。为实现这一目标，推理系统必须在延迟约束内最大化请求处理能力，即提高有效吞吐量。然而，现有系统面临两大挑战：（i）依赖先到先服务（FCFS）调度导致严重的队头阻塞，使得大量请求的排队延迟超过SLO；（ii）静态的批处理令牌限制无法适应波动的负载和硬件条件。这两方面因素均会降低有效吞吐量和服务质量。本文提出AugServe，一种旨在减少增强型LLM推理服务排队延迟并提升有效吞吐量的高效推理框架。AugServe的核心思想是采用两阶段自适应请求调度策略。具体而言，AugServe结合增强型LLM请求的推理特征优化调度决策顺序（第一阶段），并利用运行时信息持续优化这些决策（第二阶段），从而同时适应请求特性和系统能力。此外，AugServe基于硬件状态和实时负载动态调整令牌批处理机制，进一步提升吞吐性能。实验结果表明，与vLLM和InferCept相比，AugServe的有效吞吐量分别提高了4.7-33.1倍和3.3-13.2倍，同时将首令牌生成时间（TTFT）分别降低了最高96.3%和95.0%。