The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.
翻译:键值(KV)缓存随对话轮次线性增长的问题,仍然是多轮大语言模型(LLM)部署的主要瓶颈。现有的KV缓存压缩方法往往未能充分考虑多轮对话的结构特性,依赖于启发式的淘汰策略,存在丢失关键上下文的风险。本文提出 \textbf{SONIC},一种基于学习的框架,它将历史对话片段压缩为紧凑且语义丰富的 \textbf{Nexus} 令牌。通过集成动态预算训练,SONIC 能够灵活适应不同的内存约束,而无需重新训练。实验表明,在压缩比为 80\% 和 50\% 的情况下,SONIC 在四个不同的多轮对话基准测试上,其性能持续优于 H2O 和 StreamingLLM 等基线方法。具体而言,在广泛使用的 MTBench101 基准测试中,SONIC 相较于现有最优基线方法,平均得分提升了 35.55\%,验证了其在维持连贯多轮对话方面的有效性。此外,SONIC 提升了部署效率,与全上下文生成相比,整体推理过程加速了 50.1\%。