Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.
翻译:基于Transformer的大语言模型(LLM)现已部署服务于数亿用户。LLM推理通常以批处理方式处理共享前缀的序列(例如少样本示例或聊天机器人系统提示)。在此大规模批处理场景中,解码过程可能受限于注意力操作——该操作需从内存中读取大量键值(KV)缓存,并对批次中的每个序列执行低效的矩阵-向量乘积。本文提出Hydragen,一种针对共享前缀注意力的硬件感知精确实现方案。Hydragen将共享前缀与独立后缀的注意力计算分离,通过跨序列批量化查询来实现高效的前缀注意力,从而减少冗余内存读取并支持硬件友好的矩阵乘法。在CodeLlama-13b模型上,该方法可实现高达32倍的全流程吞吐量提升(相较于竞争基线),且加速比随批次大小和共享前缀长度增加。Hydragen还支持极长共享上下文:在大批量场景下,将前缀长度从1K tokens增至16K tokens时,Hydragen吞吐量降幅不足15%,而基线方法吞吐量下降超90%。该方法可推广至树状提示共享模式(超越简单的前缀-后缀分解),使我们在竞赛编程问题上的推理时间进一步降低55%。