As Large Language Models (LLMs) continue to grow, reducing costs and alleviating GPU demands has become increasingly critical. However, existing schedulers primarily target either GPU compute or Key-Value Cache (KVC) utilization, failing to fully optimize both GPU compute and KVC usage during each iteration or guarantee timely KVC allocations when needed. To address these challenges, we conducted a trace-based experimental analysis and made insightful observations, leading to the design of a system called EcoServe. EcoServe maximizes multi-resource utilization while ensuring service-level objective (SLO) guarantees in LLM serving. To enable adding prompts to a batch to maximize GPU utilization in each iteration, EcoServe maintains separate waiting queues for prompt processing tasks (PTs) and generation tasks (GTs). It batches GTs with the same predicted response lengths (RL) to save scheduling time and allocates KVC space for the predicted RL to avoid KVC allocation failures. It further has a novel KVC pipelining method, allowing sharing allocated but unused KVC space to enhance KVC utilization. In addition, it prioritizes queued requests that occupy more KVC to release KVC earlier and satisfy request service-level-objective (SLO). Experimental results demonstrate that EcoServe increases throughput by up to 4$\times$ with the same level of latency, generates up to 91\% lower job completion time and up to 91\% higher SLO satisfaction ratio compared to vLLM. It also reduces the number of GPUs used in DistServe by up to 78\% while maintaining the same level of goodput.
翻译:随着大型语言模型(LLM)规模的持续增长,降低成本和缓解GPU需求变得日益关键。然而,现有调度器主要针对GPU计算或键值缓存(KVC)利用率进行优化,无法在每次迭代中同时充分优化GPU计算与KVC使用,也无法在需要时保证及时的KVC分配。为应对这些挑战,我们进行了基于追踪的实验分析并获得了深刻见解,进而设计了一个名为EcoServe的系统。EcoServe在LLM服务中最大化多资源利用率,同时确保服务级别目标(SLO)得到保障。为了能在每次迭代中通过向批次添加提示词来最大化GPU利用率,EcoServe为提示处理任务(PT)和生成任务(GT)分别维护独立的等待队列。它将具有相同预测响应长度(RL)的GT进行批处理以节省调度时间,并为预测的RL分配KVC空间以避免KVC分配失败。系统进一步采用了一种新颖的KVC流水线方法,允许共享已分配但未使用的KVC空间,从而提升KVC利用率。此外,它优先处理占用更多KVC的排队请求,以更早释放KVC并满足请求的服务级别目标(SLO)。实验结果表明,与vLLM相比,EcoServe在相同延迟水平下将吞吐量提升高达4倍,作业完成时间降低高达91%,SLO满足率提升高达91%。在保持相同水平有效吞吐量的前提下,它还将DistServe中使用的GPU数量减少了高达78%。