Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
翻译:可训练的稀疏注意力已成为解决大语言模型在长上下文处理中解码效率瓶颈的一种有前景的方案,它能显著节省内存访问,同时对任务性能影响最小。然而,现有稀疏注意力方法未能解决一个关键限制:键值(KV)缓存的大小并未减少,这限制了GPU上的批处理规模并制约了解码吞吐量,尤其是在大规模批处理推理场景中。本文中,我们证明可训练的稀疏注意力在相邻解码步骤的令牌选择上天然表现出强局部性,从而使得在不改变底层注意力计算的前提下实现KV缓存卸载成为可能。然而,固有的局部性仍不足以实现高效卸载,因为所选KV对在CPU与GPU之间的传输成本仍主导着整体解码开销。基于这一洞察,我们提出了NOSA,一个专为原生支持KV缓存卸载而设计的可训练稀疏注意力框架。NOSA通过将令牌选择分解为查询感知和查询无关两个组件,引入了显式的局部性约束,从而在保持与训练期间相同注意力计算的同时,减少了KV传输。我们使用NOSA预训练了一个10亿参数模型,并进行了广泛的基准测试,结果表明,与基础的可训练稀疏注意力基线(InfLLM-V2)相比,NOSA在保持近乎无损性能的同时,解码吞吐量最高可提升2.3倍。