CacheGen: Fast Context Loading for Language Model Applications

Yuhan Liu,Hanchen Li,Kuntai Du,Jiayi Yao,Yihua Cheng,Yuyang Huang,Shan Lu,Michael Maire,Henry Hoffmann,Ari Holtzman,Ganesh Ananthanarayanan,Junchen Jiang

As large language models (LLMs) take on more complex tasks, their inputs incorporate longer contexts to respond to questions that require domain knowledge or user-specific conversational histories. Yet, using long contexts poses a challenge for responsive LLM systems, as nothing can be generated until all the contexts are fetched to and processed by the LLM. Existing systems optimize only the computation delay in context processing (e.g., by caching intermediate key-value features of the text context) but often cause longer network delays in context fetching (e.g., key-value features consume orders of magnitude larger bandwidth than the text context). This paper presents CacheGen to minimize the delays in fetching and processing contexts for LLMs. CacheGen reduces the bandwidth needed for transmitting long contexts' key-value (KV) features through a novel encoder that compresses KV features into more compact bitstream representations. The encoder combines adaptive quantization with a tailored arithmetic coder, taking advantage of the KV features' distributional properties, such as locality across tokens. Furthermore, CacheGen minimizes the total delay in fetching and processing a context by using a controller that determines when to load the context as compressed KV features or raw text and picks the appropriate compression level if loaded as KV features. We test CacheGen on three models of various sizes and three datasets of different context lengths. Compared to recent methods that handle long contexts, CacheGen reduces bandwidth usage by 3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3x while maintaining similar LLM performance on various tasks as loading the text contexts.

翻译：随着大语言模型承担更复杂的任务，其输入需整合更长的上下文，以回答需要领域知识或用户特定对话历史的问题。然而，使用长上下文对响应式大语言模型系统构成挑战，因为在所有上下文被获取并处理完成之前，系统无法生成任何内容。现有系统仅优化上下文处理中的计算延迟（例如通过缓存文本上下文的中间键值特征），但往往导致上下文获取中的网络延迟增加（例如键值特征消耗的带宽比文本上下文高出数个数量级）。本文提出CacheGen，旨在最小化大语言模型获取和处理上下文的延迟。CacheGen通过一种新型编码器将键值特征压缩为更紧凑的位流表示，从而降低传输长上下文键值特征所需的带宽。该编码器结合自适应量化与定制化算术编码器，利用键值特征的分布特性（如跨令牌的局部性）。此外，CacheGen通过控制器最小化获取和处理上下文的总体延迟：该控制器决定何时将上下文作为压缩键值特征或原始文本加载，并在选择键值特征加载时选取适当的压缩级别。我们在三种不同规模的模型及三个不同上下文长度的数据集上测试CacheGen。相较于近期处理长上下文的方法，CacheGen将带宽使用量降低3.7-4.3倍，获取和处理上下文的总体延迟降低2.7-3倍，同时在各项任务中保持与加载文本上下文时相似的大语言模型性能。