Context lengths of Large Language Models (LLMs) have exploded in recent years, with 128k-token context becoming a standard and million-token context becoming a reality. Efficiently supporting long-context inference remains challenging as the memory that must be allocated in key-value (KV) cache for a generation scales with its context length, limiting the number of long-context requests that can be served concurrently under a given memory budget. KV cache compression can mitigate this issue by removing under-utilized KVs from each attention head's cache and reducing its memory footprint. Higher theoretical compression rates can be achieved when the number of removed KVs varies across attention heads, but application of such a strategy within existing inference frameworks adds fragmentation and cannot realize the theoretical compression rates in physical memory. We introduce KV-Compress, a novel compression method that evicts contiguous KV blocks within a PagedAttention framework, reducing the memory footprint of the KV cache proportionally to this theoretical compression rate. Our method achieves state-of-the-art performance on LongBench for both Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct while lowering the total number of compressed KVs by 4x compared with prior methods. Evaluations on Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct-FP8 achieve compression rates up to 8x with negligible impact on performance, and up to 64x while retaining over 90% of full-cache performance for all but three of the suite's subsets. We benchmark an integration of our method with vLLM that increases total throughput by up to 5.18x by enabling larger decoding batches.
翻译:近年来,大语言模型(LLM)的上下文长度急剧增长,128k令牌的上下文已成为标准配置,而百万令牌级的上下文也已成为现实。然而,高效支持长上下文推理仍然面临挑战,因为生成过程中必须在键值(KV)缓存中分配的内存与其上下文长度成比例增长,这限制了在给定内存预算下可并发处理的长上下文请求数量。KV缓存压缩可以通过从每个注意力头的缓存中移除利用率较低的KV对来缓解此问题,从而减少其内存占用。当不同注意力头中移除的KV数量存在差异时,理论上可以实现更高的压缩率,但在现有推理框架中应用此类策略会引入内存碎片化问题,无法在物理内存中实现理论压缩率。我们提出了KV-Compress,这是一种新颖的压缩方法,它在PagedAttention框架内逐出连续的KV块,使KV缓存的内存占用按理论压缩率成比例减少。我们的方法在LongBench基准测试中,对Mistral-7B-Instruct-v0.2和Llama-3.1-8B-Instruct模型均实现了最先进的性能,同时将压缩的KV总数较先前方法降低了4倍。在Llama-3.1-8B-Instruct和Llama-3.1-70B-Instruct-FP8模型上的评估显示,该方法可实现高达8倍的压缩率且对性能影响可忽略不计,甚至在某些情况下实现高达64倍的压缩率,同时在该测试套件除三个子集外的所有子集上保持超过90%的全缓存性能。我们还将该方法与vLLM集成进行了基准测试,通过支持更大的解码批次,使总吞吐量最高提升5.18倍。