Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents -- Locret achieves over a 20x and 8x KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other methods, such as quantization and token merging. To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations.

翻译：大型语言模型（LLM）在支持长上下文理解与处理任务方面展现出显著进展。然而，将LLM的生成推理扩展至此类长上下文会带来大量额外计算负载，并需要巨大的GPU内存占用以维持基于Transformer的LLM的键值（KV）缓存。现有的KV缓存压缩方法（如量化）会随上下文长度增加面临内存瓶颈，而静态尺寸缓存（如淘汰机制）则受限于低效策略。这些限制阻碍了在消费级设备（如单张Nvidia 4090 GPU）上的部署。为此，我们提出Locret——一个用于长上下文LLM推理的框架，该框架引入保留头来评估KV缓存单元的因果重要性，从而在固定缓存容量内实现更精确的淘汰。Locret基于冻结的主干LLM，使用少量标准长上下文SFT数据进行微调。在推理过程中，我们通过分块预填充模式淘汰低重要性缓存单元，显著降低了GPU峰值内存使用量。我们通过大量实证研究评估Locret，实验结果表明：在内存效率与生成内容质量方面，Locret优于近期竞争性方法（包括InfLLM、量化、SirLLM和MInference）——相较于完整KV缓存，Locret在Phi-3-mini-128K和Llama-3.1-8B-instruct模型上分别实现了超过20倍和8倍的KV缓存压缩比。此外，Locret可与其他方法（如量化和令牌合并）结合使用。据我们所知，Locret是首个能够在单张Nvidia 4090 GPU上部署Llama-3.1-8B或类似模型的框架，可在无需额外系统优化且不损失生成质量的前提下，实现128K长上下文推理。