A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for repetitive computations and thereby lowering latency in autoregressive generation. However, the size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. In this paper, we present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective, significantly reducing the memory footprint for LLM inference. Our approach is based on the observation that KV cache states exhibit high similarity between the adjacent layers in the middle-to-deep portion of LLMs. To facilitate merging, we propose disentangling the states into the magnitude and direction components, interpolating the directions of the state vectors while preserving their lengths unchanged. Furthermore, we introduce a token retention strategy to keep highly distinct state pairs unmerged, thus preserving the information with minimal additional storage overhead. Our MiniCache is training-free and general, complementing existing KV cache compression strategies, such as quantization and sparsity. We conduct a comprehensive evaluation of MiniCache utilizing various models including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral across multiple benchmarks, demonstrating its exceptional performance in achieving superior compression ratios and high throughput. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache achieves a remarkable compression ratio of up to 5.02x, enhances inference throughput by approximately 5x, and reduces the memory footprint by 41% compared to the FP16 full cache baseline, all while maintaining near-lossless performance.
翻译:高效部署计算密集型大语言模型(LLM)的关键技术之一是键值(KV)缓存。KV缓存存储已生成词元的键值状态,显著减少重复计算需求,从而降低自回归生成的延迟。然而,KV缓存的大小随序列长度线性增长,这对需要长上下文输入和长序列生成的应用提出了挑战。本文提出一种简洁而高效的方法——MiniCache,从新颖的深度维度对跨层KV缓存进行压缩,显著降低LLM推理的内存占用。该方法基于以下观察:在LLM的中层至深层区域,相邻层间的KV缓存状态表现出高度相似性。为实现状态融合,我们提出将状态解耦为模长与方向分量,对状态向量的方向分量进行插值处理同时保持其模长不变。此外,我们引入词元保留策略,使高度差异化的状态对保持独立,从而以最小的额外存储开销保留关键信息。MiniCache无需训练且具有通用性,可与现有的KV缓存压缩策略(如量化和稀疏化)形成互补。我们使用LLaMA-2、LLaMA-3、Phi-3、Mistral和Mixtral等多种模型在多个基准测试上对MiniCache进行全面评估,结果表明其在实现卓越压缩比和高吞吐量方面表现优异。在ShareGPT数据集上,采用4位MiniCache的LLaMA-2-7B模型实现了高达5.02倍的压缩比,推理吞吐量提升约5倍,与FP16全缓存基线相比内存占用减少41%,同时保持近乎无损的性能表现。