High-throughput inference serving is essential for applications built on large language models (LLMs). Existing serving frameworks reduce request-level and batch-level bubbles through batching and scheduling, but often overlook bubbles within each decode iteration. Tokens generated in the same iteration may incur different costs because they depend on KV caches of different lengths; tokens with long KV caches can become bottlenecks and delay the next iteration. We propose AlignedServe, an LLM serving framework built around prefix-aware batching. It groups requests with similar KV-cache lengths into the same batch to reduce iteration-level bubbles. To support this policy efficiently, AlignedServe uses large CPU memory to maintain sufficient in-flight requests for batching and applies a batch-level scheduling policy to reduce batch-level bubbles. It also introduces a GPU-Prefetch-For-GPU architecture, where one GPU prefetches KV cache for another to reduce CPU-to-GPU transfer latency. Experiments on synthetic and application workloads show that AlignedServe improves decoding throughput by up to 1.98 times and reduces latency by up to 7.4 times over state-of-the-art systems.
翻译:高吞吐推理服务对于基于大语言模型(LLM)构建的应用至关重要。现有服务框架通过批处理与调度减少请求级和批次级气泡,但常忽视每次解码迭代内部的气泡。同一迭代生成的词元因依赖不同长度的KV缓存而产生不同计算代价;长KV缓存的词元可能成为瓶颈,延迟下一次迭代。我们提出AlignedServe——一种基于前缀感知批处理的LLM服务框架。它将KV缓存长度相似的请求分组到同一批次,以减少迭代级气泡。为高效支撑该策略,AlignedServe利用大容量CPU内存维持足够多的在途请求以供批处理,并采用批次级调度策略减少批次级气泡。此外,该框架引入GPU-Prefetch-For-GPU架构,即一个GPU为另一个GPU预取KV缓存,以降低CPU到GPU的传输延迟。合成负载与应用工作负载的实验表明,与现有最优系统相比,AlignedServe将解码吞吐量提升最高达1.98倍,延迟降低最高达7.4倍。