Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.
翻译:键值(KV)缓存已成为加速大语言模型(LLM)推理生成速度的事实标准。然而,随着序列长度增长带来的缓存需求激增,LLM推理已转变为内存受限问题,严重制约系统吞吐量。现有方法依赖丢弃不重要token或对所有条目进行统一量化,但此类方法在表示压缩矩阵时通常会产生较高近似误差。自回归解码过程进一步放大了每步误差,导致模型生成出现关键性偏差,并损害性能。为应对这一挑战,我们提出GEAR——一种实现近无损高比率压缩的高效KV缓存压缩框架。GEAR首先对大多数幅度相近的条目进行超低精度量化,随后采用低秩矩阵近似量化误差,并通过稀疏矩阵修正离群条目的个体误差。通过巧妙整合三种技术,GEAR能够充分发挥其协同潜力。实验表明,与替代方案相比,GEAR在实现近无损4比特KV缓存压缩的同时,吞吐量提升高达2.38倍,峰值内存占用减少高达2.29倍。我们的代码已开源至https://github.com/HaoKang-Timmy/GEAR。