Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.
翻译:摘要:在大语言模型(LLMs)中,扩展推理过程会产生严重的KV缓存内存瓶颈。当前的领先KV缓存压缩方法利用最近后RoPE查询的注意力分数来估计KV重要性。然而,在RoPE过程中,查询向量会随位置旋转,导致代表性查询极少,进而造成关键顶部选择不佳和推理不稳定。为解决此问题,我们转向前RoPE空间,观察到Q和K向量高度集中于固定的非零中心,并在不同位置上保持稳定——即Q/K集中现象。我们证明,这种集中性使得查询倾向于关注特定距离处的关键(例如最近的关键),而中心通过三角级数决定哪些距离被偏好。基于此,我们提出TriAttention,通过利用这些中心来估计关键重要性。借助三角级数,我们使用这些中心表征的距离偏好,根据关键位置对其进行评分,同时利用Q/K范数作为重要性估计的额外信号。在AIME25的32K token生成任务中,TriAttention在匹配全注意力推理精度的同时,实现了2.5倍的吞吐量提升或10.7倍的KV内存缩减,而领先基线方法在相同效率下仅能达到约一半的精度。TriAttention使得OpenClaw能够在单个消费级GPU上部署,而全注意力在长上下文场景下会导致内存不足。