Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering the individual importance of each KV cache. However, such KV cache heterogeneity introduces various systemic challenges - including memory fragmentation, scheduling complexities, and diminished kernel utilization - which collectively lead to significant inefficiencies in existing LLM serving systems. To overcome these challenges, we present Tangram, a novel serving system designed to make Non-uniform KV caches practical. Tangram addresses systemic inefficiencies through three core techniques: (1) Deterministic Budget Allocation assigns a static memory footprint to each head based on its intrinsic pattern, entirely eliminating dynamic scheduling overhead and prefill stalls; (2) Head Group Page clusters attention heads with similar retention demands and manages them with independent, vectorized page tables, thereby maximizing physical memory reclamation; and (3) Ahead-of-Time (AOT) Load Balancing leverages static budget profiles to ensure uniform GPU utilization without runtime overhead. Experimental results show that Tangram improves throughput by up to 2.6x compared to existing baselines, while fully preserving model accuracy. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.
翻译:多轮大语言模型(LLM)服务对于保持用户交互体验的一致性至关重要,然而键值(KV)缓存的线性增长给GPU内存与带宽带来了显著压力。非均匀KV压缩通过考虑每个KV缓存的个体重要性,能够更有效地保留信息。然而,这种KV缓存异构性引发了诸多系统性挑战——包括内存碎片化、调度复杂性以及内核利用率下降——这些因素共同导致现有LLM服务系统效率严重低下。为克服上述挑战,我们提出了Tangram,一种旨在使非均匀KV缓存变得实用的新型服务系统。Tangram通过三项核心技术解决系统效率问题:(1)确定性预算分配——根据每个注意力头部的固有模式分配静态内存占用,完全消除动态调度开销与预填充停顿;(2)头部组页面——将具有相似留存需求的注意力头部聚类,并通过独立向量化页表进行管理,从而最大化物理内存回收;(3)提前(AOT)负载均衡——利用静态预算分布确保GPU负载均衡,且无运行时开销。实验结果表明,与现有基线相比,Tangram在完全保持模型精度的前提下,吞吐量提升高达2.6倍。我们的实现已在https://github.com/aiha-lab/TANGRAM 公开提供。