Although Key-Value (KV) Cache is essential for efficient large language models (LLMs) inference, its growing memory footprint in long-context scenarios poses a significant bottleneck, making KVCache compression crucial. Current compression methods rely on rigid splitting strategies, such as fixed intervals or pre-defined delimiters. We observe that rigid splitting suffers from significant accuracy degradation (ranging from 5.5% to 55.1%) across different scenarios, owing to the scenario-dependent nature of the semantic boundaries. This highlights the necessity of dynamic semantic splitting to match semantics. To achieve this, we face two challenges. (1) Improper delimiter selection misaligns semantics with the KVCache, resulting in 28.6% accuracy loss. (2) Variable-length blocks after splitting introduce over 73.1% additional inference overhead. To address the above challenges, we propose DynSplit-KV, a KVCache compression method that dynamically identifies delimiters for splitting. We propose: (1) a dynamic importance-aware delimiter selection strategy, improving accuracy by 49.9%. (2) A uniform mapping strategy that transforms variable-length semantic blocks into a fixed-length format, reducing inference overhead by 4.9x. Experiments show that DynSplit-KV achieves the highest accuracy, 2.2x speedup compared with FlashAttention and 2.6x peak memory reduction in long-context scenarios.
翻译:尽管键值(KV)缓存对于高效的大型语言模型(LLM)推理至关重要,但其在长上下文场景下不断增长的内存占用构成了显著瓶颈,使得KVCache压缩变得尤为关键。现有的压缩方法依赖于固定的分割策略,例如固定间隔或预定义的分隔符。我们观察到,由于语义边界具有场景依赖性,固定分割在不同场景下会导致显著的准确率下降(范围从5.5%到55.1%)。这凸显了采用动态语义分割以匹配语义的必要性。为实现这一目标,我们面临两大挑战:(1)不当的分隔符选择会导致语义与KVCache错位,造成高达28.6%的准确率损失。(2)分割后产生的变长块会引入超过73.1%的额外推理开销。为应对上述挑战,我们提出了DynSplit-KV,一种动态识别分隔符进行分割的KVCache压缩方法。我们提出了:(1)一种动态重要性感知的分隔符选择策略,将准确率提升49.9%。(2)一种统一映射策略,将变长语义块转换为固定长度格式,将推理开销降低4.9倍。实验表明,在长上下文场景中,DynSplit-KV实现了最高的准确率,相比FlashAttention获得了2.2倍的加速,并实现了2.6倍的峰值内存降低。