KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present PolyKV, a layer-wise KV cache optimization framework that considers design space with method selection and budget allocation. PolyKV routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods. Experiments on LLaMA-3.1-8B and Qwen3-8B show that, under the same 512-token average KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap between the strongest single-policy baseline and FullKV, respectively. Across 128-1024 budget sweep, PolyKV consistently improves over the strongest baseline by 1.7%-6.4%, corresponding to 40.0%-54.5% recovery of the FullKV gap.
翻译:KV缓存压缩对于降低长上下文大语言模型推理中的内存成本至关重要。然而,现有方法通常在所有Transformer层上应用单一压缩策略和统一的缓存预算。这种统一设计忽略了不同层在预填充和解码阶段可能发挥不同作用的事实,因此可能需要不同的驱逐策略和缓存容量。我们提出PolyKV,一种逐层KV缓存优化框架,综合考虑方法选择与预算分配的设计空间。PolyKV基于逐层信号将每一层路由至合适的KV压缩策略,同时在固定总预算下分配非均匀预算。该公式能够实现现有KV缓存方法的异构组合。在LLaMA-3.1-8B和Qwen3-8B上的实验表明,在相同的512平均KV预算下,PolyKV分别恢复了最强单一策略基线方法与FullKV之间LongBench性能差距的54.5%和25.7%。在128-1024的预算扫描中,PolyKV始终较最强基线提升1.7%-6.4%,对应FullKV差距40.0%-54.5%的恢复比例。