Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to $1.7\times$ or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to $2.6\times$ over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.

翻译：[翻译摘要] 多轮大语言模型服务会累积对话历史，其键值（KV）缓存随每一轮对话和每一位用户持续增长，迅速超过模型权重本身，使得内存——而非计算——成为吞吐量的瓶颈。非均匀KV压缩在不同注意力头之间分配异构预算，在保持精度方面远优于均匀方案，但仍不实用：现代服务架构假设各注意力头的KV长度相同，导致异构性将释放的内存陷入页碎片化、消耗高达25%的预填充时间回收分散的页，并使GPU负载失衡，将解码延迟放大至1.7倍，或在每次解码步骤中消耗15-20%用于重新规划。我们发现这种异构性无需在运行时发现：注意力头的保留行为遵循两级结构规律——一个输入无关的注意力头排名，以及每个注意力头比例具有狭窄的边界范围——该规律可通过低至50个样本的离线校准校准。基于此洞察，我们提出Tangram，一个静态解决先前系统动态处理问题的服务框架：预算预留技术在调度时固定每个注意力头压缩后的内存占用，消除页回收；分块分页技术将相似预算的注意力头聚合并独立页表，将碎片转化为可回收内存；预先负载均衡技术通过零运行时规划预计算平衡的GPU分区。在vLLM上实现的Tangram作为现有非均匀压缩方法的即插即用基座，在匹配其精度的同时，相比于完整KV基线将端到端吞吐量提升达2.6倍。我们的实现已在https://github.com/aiha-lab/TANGRAM 公开。