Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficiencies: (1) the absence of unified KV cache sizing across all attention architectures--particularly multi-head latent attention (MLA), which is unsupported in general-purpose frameworks, resulting in up to 57x memory over-provisioning; (2) confinement of KV cache to a single memory tier (GPU HBM) despite the availability of a rich hierarchy spanning CPU DRAM, CXL-attached memory, NVMe via GPUDirect Storage, RDMA fabric, and parallel filesystems; and (3) reactive eviction policies that discard reusable state, forcing redundant recomputation. We present a unified system that addresses all three problems. Our architecture-variant-aware sizing engine computes exact memory requirements per attention type, enabling up to 7.4x higher batch sizes. A six-tier memory hierarchy extends effective KV cache capacity from 40 GB to over 38 TB per node while maintaining sub-millisecond time-to-first-token (TTFT) for hot entries. A Bayesian reuse predictor with Beta conjugate priors over 16 (block-type, transition-type) pairs achieves 70-84% cache hit rates, combined with EMA-scored head-granular eviction and RoPE-aware prefetching. Component-level validation on trace replay using ShareGPT, LMSYS-Chat-1M, and agentic workloads demonstrates 70-84% cache hit rates. Analytical projections combining validated component behavior with published hardware specifications indicate 1.4-2.1x projected TTFT reduction, 1.7-2.9x throughput improvement, and 47% cost reduction compared to state-of-the-art baselines.
翻译:键值(KV)缓存内存管理是制约大规模GPU推理服务吞吐量与成本效益的主要瓶颈。现有系统存在三重叠加的低效问题:(1)缺乏跨所有注意力架构的统一KV缓存容量规划——尤其是通用框架不支持的多头潜在注意力(MLA),导致内存过度预配高达57倍;(2)尽管存在涵盖CPU DRAM、CXL附加内存、基于GPUDirect Storage的NVMe、RDMA架构及并行文件系统的丰富层级结构,但KV缓存仍局限于单一内存层级(GPU HBM);(3)采用反应式驱逐策略丢弃可复用状态,迫使进行冗余重计算。我们提出一个统一系统解决上述三个问题。我们的架构变体感知容量计算引擎可精确计算每种注意力类型的内存需求,使批处理规模最高提升7.4倍。六级内存层级将每个节点的有效KV缓存容量从40 GB扩展至超过38 TB,同时为热条目保持亚毫秒级的首令牌生成时间(TTFT)。基于贝叶斯复用预测器(对16个(块类型,过渡类型)对采用Beta共轭先验)实现70-84%的缓存命中率,结合基于EMA评分的头部粒度驱逐与RoPE感知预取。在ShareGPT、LMSYS-Chat-1M及代理型工作负载的轨迹回放验证中,组件级测试显示70-84%的缓存命中率。结合已验证组件行为与公开硬件规格的分析预测表明,与现有最优基准相比,TTFT预计降低1.4-2.1倍,吞吐量提升1.7-2.9倍,成本降低47%。