ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving

Many applications are leveraging large language models (LLMs) for complex tasks, and they generally demand low inference latency and high serving throughput for interactive online jobs such as chatbots. However, the tight latency requirement and high load variance of applications pose challenges to serving systems in achieving high GPU utilization. Due to the high costs of scheduling and preemption, today's systems generally use separate clusters to serve online and offline inference tasks, and dedicate GPUs for online inferences to avoid interference. This approach leads to underutilized GPUs because one must reserve enough GPU resources for the peak expected load, even if the average load is low. This paper proposes to harvest stranded GPU resources for offline LLM inference tasks such as document summarization and LLM benchmarking. Unlike online inferences, these tasks usually run in a batch-processing manner with loose latency requirements, making them a good fit for stranded resources that are only available shortly. To enable safe and efficient GPU harvesting without interfering with online tasks, we built ConServe, an LLM serving system that contains (1) an execution engine that preempts running offline tasks upon the arrival of online tasks, (2) an incremental checkpointing mechanism that minimizes the amount of recomputation required by preemptions, and (3) a scheduler that adaptively batches offline tasks for higher GPU utilization. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks but at a much higher GPU utilization. When colocating practical online and offline workloads on popular models such as Llama-2-7B, ConServe achieves 2.35$\times$ higher throughput than state-of-the-art online serving systems and reduces serving latency by 84$\times$ compared to existing co-serving systems.

翻译：许多应用正利用大语言模型（LLMs）处理复杂任务，通常对聊天机器人等交互式在线任务要求低推理延迟与高服务吞吐量。然而，严格的延迟要求与高负载波动性给服务系统实现高GPU利用率带来了挑战。由于调度与抢占成本高昂，现有系统通常采用独立集群分别处理在线与离线推理任务，并为在线推理独占GPU以避免干扰。这种方式导致GPU利用率不足，因为即使平均负载较低，仍需为预期峰值负载预留充足GPU资源。本文提出利用闲置GPU资源处理文档摘要和LLM基准测试等离线LLM推理任务。与在线推理不同，此类任务通常以批处理方式运行且延迟要求宽松，非常适合短期可用的闲置资源。为实现安全高效的GPU资源利用且不干扰在线任务，我们构建了ConServe——一个包含以下组件的LLM服务系统：（1）在线任务到达时可抢占运行中离线任务的执行引擎；（2）最小化抢占所需重计算量的增量检查点机制；（3）自适应批处理离线任务以提升GPU利用率的调度器。评估结果表明，ConServe在同时服务在线与离线任务时实现了强性能隔离，且GPU利用率显著提升。在Llama-2-7B等流行模型上协同部署实际在线与离线工作负载时，ConServe相比最先进的在线服务系统吞吐量提升2.35倍，与现有协同服务系统相比服务延迟降低84倍。