Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy.
翻译:基于Transformer的大语言模型(LLMs)在各种自然语言处理任务中展现出卓越性能。然而,在为生成长文本内容提供LLM推理服务时,由于临时状态(即键值(KV)缓存)的巨大内存占用带来了挑战,该缓存大小随序列长度和批处理规模线性增长。本文提出InfiniGen,一种专为长文本生成设计的新型KV缓存管理框架,可与现代基于卸载的推理系统协同工作。InfiniGen基于一个关键洞见:通过使用当前层的输入及下一层的部分查询权重和键缓存进行最小化预演,可推测出对计算Transformer下一注意力层至关重要的少量关键令牌。这使得我们能够仅预取必要的KV缓存条目(而非全部获取),从而减轻基于卸载的LLM服务系统中从主机内存获取数据的开销。我们在多个代表性LLM上的评估表明,与先前的KV缓存管理方法相比,InfiniGen将现代基于卸载系统的整体性能提升最高达3.00倍,同时提供显著更优的模型精度。