ThinK: Thinner Key Cache by Query-Driven Pruning

Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications by leveraging increased model sizes and sequence lengths. However, the associated rise in computational and memory costs poses significant challenges, particularly in managing long sequences due to the quadratic complexity of the transformer attention mechanism. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence lengths, we uncover that the channel dimension of the KV cache exhibits significant redundancy, characterized by unbalanced magnitude distribution and low-rank structure in attention weights. Based on these observations, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in memory costs by over 20% compared with vanilla KV cache eviction methods. Extensive evaluations on the LLaMA3 and Mistral models across various long-sequence datasets confirm the efficacy of ThinK, setting a new precedent for efficient LLM deployment without compromising performance. We also outline the potential of extending our method to value cache pruning, demonstrating ThinK's versatility and broad applicability in reducing both memory and computational overheads.

翻译：大型语言模型（LLMs）通过利用不断增加的模型规模和序列长度，在自然语言处理领域引发了革命，并在各类应用中取得了前所未有的性能。然而，随之而来的计算和内存成本的大幅上升也带来了重大挑战，特别是由于Transformer注意力机制的二次复杂度，使得处理长序列时面临困难。本文聚焦于长上下文场景，旨在解决推理过程中键值（KV）缓存内存消耗的效率低下问题。与现有基于序列长度优化内存的方法不同，我们发现KV缓存的通道维度存在显著的冗余性，具体表现为注意力权重中不平衡的幅度分布和低秩结构。基于这些观察，我们提出了ThinK，一种新颖的查询依赖型KV缓存剪枝方法，旨在最小化注意力权重损失的同时，有选择性地剪枝最不重要的通道。我们的方法不仅保持甚至提升了模型精度，而且相较于原始的KV缓存淘汰方法，实现了超过20%的内存成本降低。在LLaMA3和Mistral模型上，针对多种长序列数据集进行的广泛评估证实了ThinK的有效性，为在不牺牲性能的前提下高效部署LLM树立了新的范例。我们还概述了将我们的方法扩展至值（Value）缓存剪枝的潜力，这展示了ThinK在降低内存和计算开销方面的多功能性与广泛适用性。