In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.
翻译:本文研究一种支持多样化应用的大语言模型推理服务系统的混合提示场景,该系统同时处理短提示与长提示,并满足迭代时间的异构服务水平目标。为提升处理长提示时的吞吐量,先前研究引入了分块方法,但未解决异构服务水平目标问题。为克服此局限,我们提出AccelGen——一种面向多样化应用、具备异构服务水平目标保障的高通量大语言模型推理服务系统。AccelGen引入四个核心组件:(1) SLO保障的动态分块机制,动态调整分块大小以在满足迭代级服务水平目标的同时最大化GPU计算利用率;(2) 基于迭代级服务水平目标的任务优先级调度,优先处理严格服务水平目标请求并将具有相似服务水平目标的请求进行批处理;(3) 多资源感知的批处理策略,选择队列中的请求以同时最大化GPU计算资源与键值缓存(KVC)的利用率。基于真实轨迹的实验表明,相较于现有最优方法,AccelGen实现了1.42-11.21倍的吞吐量提升、1.43-13.71倍的有效吞吐量提升、37-90%的服务水平目标达成率提升以及1.61-12.22倍的响应延迟降低。其性能接近以最优方式最大化有效吞吐量的Oracle基准。