Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters. We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2$\times$ speedup in latency and a 2.5$\times$ improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency.
翻译:大型语言模型(LLM)已广泛应用于各类场景中,其上下文长度正迅速增长以处理长文档问答和复杂逻辑推理等任务。然而,长上下文对推理效率提出了重大挑战,包括键值(KV)缓存的高内存成本以及大量内存访问导致的延迟增加。近期研究提出了压缩KV缓存以近似计算的方法,但这些方法要么永久丢弃某些词元且后续推理中永不召回,要么仅能按文本位置划分的页面粒度召回先前词元。这两种方式均会降低模型精度和输出质量。为实现高效且精准的可召回KV缓存压缩,我们提出了ClusterKV,该方法能够在语义聚类粒度上召回词元。我们设计并实现了用于聚类、选择、索引与缓存的高效算法及系统。实验结果表明,在32k上下文长度下,ClusterKV仅使用1k至2k的KV缓存预算,即可在各种任务中实现可忽略的精度损失,并达成高达2$\times$的延迟加速与2.5$\times$的解码吞吐量提升。相较于当前最优的可召回KV压缩方法,ClusterKV在保持或超越推理效率的同时,展现出更高的模型精度与输出质量。