Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices are to include domain-specific instructions, illustration of tool usages, and/or long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests. Recent works propose to cache and reuse KV state of prompts. However, they are all confined to a single-GPU optimization, while production LLM serving systems are distributed by nature. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.
翻译:面向大语言模型(LLM)的提示已超越简单的用户问题。为使LLM能够解决复杂问题,当前实践通常在提示中包含领域特定的指令、工具使用示例以及长上下文(如教科书章节)。因此,提示的许多部分在不同请求间存在重复。近期研究提出缓存并复用提示的KV状态,但这些工作均局限于单GPU优化,而生产级LLM服务系统本质上是分布式的。本文提出Preble——首个以提示共享为目标并进行优化的分布式LLM服务平台。我们设计了一种分布式调度系统,通过新型调度算法与分层调度机制,协同优化KV状态复用与计算负载均衡。基于真实工作负载和请求到达模式对两个开源LLM的评估表明,Preble在平均延迟上优于现有最优服务系统1.5倍至14.5倍,在p99延迟上优于2倍至10倍。