Practical large language model (LLM) services may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across numerous requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (i.e., key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. Code is available at \url{https://github.com/rayleizhu/vllm-ra}.
翻译:实际的大语言模型服务可能涉及长系统提示,该提示规定了任务的指令、示例和知识文档,并在大量请求中重复使用。然而,长系统提示会导致吞吐量/延迟瓶颈,因为生成下一个令牌的成本随序列长度增加而增加。本文旨在提高涉及长系统提示的大语言模型服务效率。我们的关键观察是,在现有因果注意力计算算法中处理这些系统提示需要大量冗余的内存访问。具体而言,对于批量请求,系统提示的缓存隐藏状态(即键值对)需多次从片外DRAM传输到片内SRAM,每次传输对应一个独立请求。为消除此类冗余,我们提出RelayAttention,一种注意力算法,允许对一批输入令牌仅从DRAM读取一次这些隐藏状态。RelayAttention是一种"免费午餐":它保持生成质量且无需模型重训练,因其基于因果注意力的数学重表述。代码开源在\url{https://github.com/rayleizhu/vllm-ra}。