Long-context decoding in Large Language Models (LLMs) is constrained by the cost of accessing and processing the Key-Value (KV) cache. Despite evidence that attention outputs depend jointly on keys and values, most existing KV management methods rely on key-only pruning, since incorporating values incurs prohibitive overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. Rather than replacing KV selection, ART dynamically terminates redundant KV traversal on top of existing dense or sparse attention policies. We introduce a stability-based criterion that monitors both magnitude and directional changes of intermediate attention outputs and provideds a theoretical characterization of the resulting truncation error. Experiments on the LongBench and RULER Needle-in-a-Haystack tasks show that ART increases the generation throughput of existing KV-cache methods by up to 20%, without compromising the result quality.
翻译:大语言模型长上下文解码受限于访问和处理键值缓存的成本。尽管已有证据表明注意力输出同时依赖于键和值,但现有的大多数KV管理方法仅采用键剪枝策略,因为引入值会带来难以承受的开销。本文提出注意力运行时终止机制(ART),这是一种轻量级运行时机制,可在内核执行期间追踪累积注意力输出,并在后续KV块贡献可忽略时终止其访问。ART并非替代KV选择策略,而是基于现有密集或稀疏注意力策略动态终止冗余KV遍历。我们提出基于稳定性的判据,通过监测中间注意力输出的幅度与方向变化,并提供截断误差的理论表征。在LongBench和RULER needle-in-a-haystack任务上的实验表明,ART可将现有KV缓存方法的生成吞吐量提升高达20%,且不损害结果质量。