Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.
翻译:键值(KV)缓存已成为加速大语言模型(LLM)推理生成速度的事实标准。然而,随着序列长度增长而不断增加的缓存需求,使得LLM推理转变为内存受限问题,严重制约了系统吞吐量。现有方法依赖于丢弃不重要的令牌或对所有条目进行均匀量化。然而,此类方法在表示压缩矩阵时往往产生较高的近似误差。自回归解码过程进一步累积了每一步的误差,导致模型生成出现关键偏差和性能下降。为应对这一挑战,我们提出了GEAR,一种高效的KV缓存压缩框架,能够实现近无损的高比率压缩。GEAR首先对大多数幅值相近的条目进行超低精度量化,随后采用低秩矩阵来近似量化误差,并利用稀疏矩阵来修正异常值条目产生的个体误差。通过巧妙整合这三种技术,GEAR能够充分发挥其协同潜力。实验表明,与现有方法相比,GEAR在实现近无损4位KV缓存压缩的同时,可获得高达2.38倍的吞吐量提升,并将峰值内存占用降低至多2.29倍。代码已公开于https://github.com/HaoKang-Timmy/GEAR。