Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to $1.7\times$ or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to $2.6\times$ over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.
翻译:[翻译摘要]
多轮大语言模型服务会累积对话历史,其键值(KV)缓存随每一轮对话和每一位用户持续增长,迅速超过模型权重本身,使得内存——而非计算——成为吞吐量的瓶颈。非均匀KV压缩在不同注意力头之间分配异构预算,在保持精度方面远优于均匀方案,但仍不实用:现代服务架构假设各注意力头的KV长度相同,导致异构性将释放的内存陷入页碎片化、消耗高达25%的预填充时间回收分散的页,并使GPU负载失衡,将解码延迟放大至1.7倍,或在每次解码步骤中消耗15-20%用于重新规划。我们发现这种异构性无需在运行时发现:注意力头的保留行为遵循两级结构规律——一个输入无关的注意力头排名,以及每个注意力头比例具有狭窄的边界范围——该规律可通过低至50个样本的离线校准校准。基于此洞察,我们提出Tangram,一个静态解决先前系统动态处理问题的服务框架:预算预留技术在调度时固定每个注意力头压缩后的内存占用,消除页回收;分块分页技术将相似预算的注意力头聚合并独立页表,将碎片转化为可回收内存;预先负载均衡技术通过零运行时规划预计算平衡的GPU分区。在vLLM上实现的Tangram作为现有非均匀压缩方法的即插即用基座,在匹配其精度的同时,相比于完整KV基线将端到端吞吐量提升达2.6倍。我们的实现已在https://github.com/aiha-lab/TANGRAM 公开。