Large Language Models (LLMs) have recently experienced great success, as evident in the widespread popularity of ChatGPT. Existing LLM serving systems are stateless across requests. Consequently, when LLMs are used in the common setting of multi-turn conversations, a growing log of the conversation history must be processed alongside any request by the serving system at each turn, resulting in repeated history processing. In this paper, we design $Pensieve$, a system optimized for multi-turn conversation LLM serving. $Pensieve$ maintains the conversation state across requests by caching previously processed history to avoid duplicate processing. $Pensieve$'s multi-tier caching strategy can utilize both GPU and CPU memory to efficiently store and retrieve cached data. $Pensieve$ also generalizes the recent PagedAttention kernel to support attention between multiple input tokens with a GPU cache spread over non-contiguous memory. Our evaluation shows that $Pensieve$ is able to achieve 1.51-1.95x throughput compared to vLLM and reduce latency by 60-75%.
翻译:大型语言模型(LLM)近期取得了巨大成功,ChatGPT的广泛流行便是明证。现有LLM服务系统在跨请求处理时是无状态的。因此,当LLM应用于多轮对话这一常见场景时,服务系统每轮都需要处理不断累积的对话历史记录及当前请求,导致历史数据的重复计算。本文设计了专为多轮对话LLM服务优化的系统Pensieve。该系统通过缓存已处理的历史数据来维护跨请求的对话状态,从而避免重复计算。Pensieve的多级缓存策略可同时利用GPU和CPU内存高效存储与检索缓存数据。此外,Pensieve对近期提出的PagedAttention内核进行了泛化,支持基于非连续内存分布的GPU缓存实现多个输入token间的注意力计算。实验评估表明,与vLLM相比,Pensieve的吞吐量提升1.51-1.95倍,延迟降低60-75%。