Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the largest-scale empirical analysis to date of training-free sparse attention, evaluating six methods across multiple model families and sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 (i.e., $1/20$ attention budget) on nine diverse tasks. We first organise the rapidly evolving landscape of sparse attention methods into a taxonomy along four design axes. Our analysis then yields actionable insights: 1) sparse attention is effective -- larger sparse models outperform smaller dense ones at equivalent cost, improving the Pareto frontier; 2) due to computational constraints, token-to-page importance estimation is unfeasible during prefilling, where the choice of an alternative solution (global-to-token or block-to-block) depends on the task, but is possible during decoding, enabling better generalisation and tolerance to higher sparsity; 3) longer sequences tolerate higher sparsity, suggesting that fixed-budget methods in production are suboptimal. Together, these findings provide practical guidance for deploying sparse attention and methodological recommendations for future evaluations. Our code is available at https://github.com/PiotrNawrot/sparse-frontier.
翻译:稀疏注意力为扩展Transformer大语言模型的长上下文能力提供了一种前景广阔的策略,但由于缺乏全面评估,其效率与精度之间的权衡关系仍不明确。我们通过迄今最大规模的免训练稀疏注意力实证分析来填补这一空白,在九个多样化任务上评估了六种方法,涵盖多个模型系列与规模、序列长度高达128K词元,以及稀疏度最高达0.95(即$1/20$的注意力计算预算)。我们首先将快速演进的稀疏注意力方法沿四个设计维度组织成分类体系。随后分析得出可操作的结论:1)稀疏注意力是有效的——在同等计算成本下,更大的稀疏模型性能优于更小的稠密模型,从而改进了帕累托前沿;2)受计算限制,在预填充阶段进行词元到页面的重要性估计不可行,此时替代方案(全局到词元或块到块)的选择取决于具体任务;但在解码阶段该估计可行,可实现更好的泛化能力并耐受更高稀疏度;3)更长的序列可耐受更高稀疏度,这表明生产环境中采用固定预算的方法是次优的。这些发现共同为部署稀疏注意力提供了实践指导,并为未来评估提出了方法论建议。代码发布于https://github.com/PiotrNawrot/sparse-frontier。