Large language models (LLMs) demonstrate exceptional performance but incur high serving costs due to substantial memory demands, with the key-value (KV) cache being a primary bottleneck. Existing KV cache compression methods, including quantization and pruning, struggle with limitations such as uniform treatment of keys and values and static memory allocation across attention heads. To address these challenges, we introduce LeanKV, a unified KV cache compression framework that enhances LLM serving efficiency without compromising accuracy through three innovations: (1) Hetero-KV quantization, which stores keys at a higher precision than values to reflect their greater impact on attention computations; (2) per-head dynamic sparsity, which allocates memory based on token importance per head and per request; and (3) unified KV compression, integrating mixed-precision quantization and selective pruning to enable a smooth tradeoff between model accuracy and memory efficiency. To efficiently support these techniques, LeanKV introduces systems optimizations including unified paging and on-GPU parallel memory management. Implemented on vLLM, LeanKV compresses the KV cache by $3.0\times$ to $5.0\times$ without accuracy loss and up to $11.0\times$ with under 5% accuracy loss, enhancing throughput by $1.9\times$ to $2.5\times$, and up to $6.9\times$.
翻译:大语言模型展现出卓越性能,但由于巨大的内存需求导致服务成本高昂,其中键值缓存是主要瓶颈。现有键值缓存压缩方法(包括量化和剪枝)存在诸多局限,例如对键和值的均一化处理以及跨注意力头的静态内存分配。为解决这些挑战,我们提出LeanKV——一个统一的键值缓存压缩框架,通过三项创新在不损失精度的前提下提升大语言模型服务效率:(1)异构键值量化:以高于值的精度存储键,以反映其对注意力计算的更大影响;(2)逐头动态稀疏化:根据每个注意力头及每个请求的令牌重要性分配内存;(3)统一键值压缩:集成混合精度量化与选择性剪枝,实现模型精度与内存效率的平滑权衡。为高效支持这些技术,LeanKV引入了系统级优化,包括统一分页和GPU端并行内存管理。在vLLM上实现的LeanKV可将键值缓存压缩$3.0\times$至$5.0\times$且无精度损失,最高压缩$11.0\times$时精度损失低于5%,同时将吞吐量提升$1.9\times$至$2.5\times$,最高可达$6.9\times$。