Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.
翻译:长上下文推理日益受到KV缓存的制约:常驻内存随上下文长度增长,而解码过程受限于重复的高带宽内存(HBM)流读取而非算术运算。现有方法(如驱逐、窗口化、量化和卸载)虽可降低存储占用,但通常仅部分解决关键路径瓶颈——尤其在解码阶段需要将压缩状态重建为密集向量时。本文提出Spherical KV,一种将KV分配视为率失真问题的长上下文推理方法,其基于注意力几何特性实现高效解码。该方法基于两个核心思想:(i) 在解码热循环中低成本表示方向信息;(ii) 根据预估的未来效用分配保留精度与优先级。其首个组件——**角度域注意力(Angle-Domain Attention, ADA)**——将键值以球面参数化形式存储(包含标量半径与紧凑角度编码),并直接基于这些编码计算注意力对数,无需重建密集键值。该方法保留了分页式、块局部、融合友好的解码路径,在实际服务场景中精准减少HBM流量。第二个组件——**率失真保留(Rate-Distortion Retention, RDR)**——在固定预算下联合选择每个词元与注意力头的保留/丢弃决策及精度层级,生成层级均匀的分页(含轻量元数据与合并读取)。ADA与RDR共同提供面向部署的机制,在保持解码效率的同时降低KV驻留内存。