Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.
翻译:推理型大语言模型生成数千个思维链词元,其键值缓存必须驻留于稀缺的GPU高带宽存储器中。当前主流方案——永久驱逐低重要性词元——对推理任务造成灾难性影响:当缓存容量缩减一半时,准确率骤降至0-2.5%。我们提出不同问题:每个词元都必须驻留高带宽存储器吗?能否将部分词元迁移至其他存储层级?本文引入语义感知存储层级架构,通过累积注意力评分机制将词元划分为四个层级——高带宽存储器、双倍数据速率存储器、压缩存储、以及驱逐存储。低重要性词元被转存至中央处理器内存而非直接丢弃;在每次注意力计算步骤前,系统以全精度将这些词元预取回显存,其贡献的数学项与从未离开图形处理器的结果完全等价。我们将其形式化为零近似误差卸载机制,并得出核心发现:推理准确率仅取决于永久丢弃词元的比例(驱逐率),而非高带宽存储器中保留的词元数量。通过控制高带宽存储器与驱逐率的3×3网格实验,该结论在三种模型规模(7B-32B)和四个基准测试中均得到验证。当驱逐率仅为3%时,该层级架构在GSM8K上保留全缓存准确率的91%,在MATH-500(n=200)上保留71%;在14B规模下,该架构与未压缩基线准确率持平(90% vs 86%),同时将高带宽存储器占用减半。在等预算条件下,对当前最先进驱逐方法R-KV的复现测试显示其准确率仅达0-32%。基于真实GPU-CPU数据传输的系统原型表明,这种保留策略的代价可控——传输开销仅为5-7%,且缩放分析显示在工业生产批次规模下可节省2-48 GB高带宽存储器空间。