Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV
翻译:大语言模型(LLM)正越来越多地被部署在需要超长上下文推理的场景中,例如智能体工作流和深度研究理解。然而,长上下文推理受到KV缓存的制约——这是一种瞬态内存结构,其大小随序列长度和批处理规模线性增长,并迅速占据主导性的GPU内存使用。现有的内存缩减技术(包括驱逐和量化)通常依赖静态启发式方法,在内存预算紧张时会导致质量下降。本文提出ARKV,一种轻量级且自适应的框架,该框架基于逐层注意力动态和词元级重要性,动态地为缓存词元分配精度等级。在短暂的预填充阶段,ARKV通过计算注意力熵、方差和峰度等统计分数来估计每层的原始量化(OQ)比例。在解码阶段,根据一种快速的"重击者"评分策略,词元被分配至三种状态之一:原始(全精度)、量化(低精度)或驱逐。我们在LLaMA3和Qwen3模型上针对多样化的长、短上下文任务进行的实验表明,ARKV在长上下文基准测试中保留了约97%的基线准确率,同时将KV内存使用量降低了4倍,且吞吐量损失极小。在短上下文任务中,ARKV的表现与全精度基线相当;在GSM8K数学推理任务上,其性能显著优于均匀量化。这些结果突显了ARKV在可扩展的LLM部署中的实际可行性,提供了无需重新训练或架构修改的细粒度、数据驱动的内存控制。源代码与工件可在以下地址获取:https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV