The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.
翻译:注意力机制的二次复杂度仍然是大型语言模型长上下文推理的核心瓶颈。现有的加速方法要么通过结构化模式稀疏化注意力图,要么在特定层永久移除令牌,这些方法可能保留无关令牌或依赖于不可逆的早期决策,而忽略了令牌重要性在层间和注意力头间的动态变化。本文提出令牌稀疏注意力,这是一种轻量级动态令牌级稀疏化机制,在注意力计算过程中将每个注意力头的$Q$、$K$、$V$压缩至缩减后的令牌集合,随后将输出解压缩回原始序列,使得令牌信息能够在后续层中被重新考量。此外,令牌稀疏注意力在令牌选择与稀疏注意力的交叉领域揭示了一个新的设计维度。该方法与稠密注意力实现(包括Flash Attention)完全兼容,并可无缝集成现有稀疏注意力内核。实验结果表明,令牌稀疏注意力持续优化了精度-延迟权衡,在128K上下文长度下实现了高达$\times$3.23的注意力加速,同时精度损失小于1%。这些结果证明,动态交错令牌级稀疏化是实现可扩展长上下文推理的一种互补且有效的策略。