As large language models scale to longer contexts, loading the growing KV cache during attention computation becomes a critical bottleneck. Previous work has shown that attention computation is dominated by a small subset of tokens. This motivates block sparse attention methods that partition the KV cache into fixed-size blocks and selectively compute attention over those blocks exhibiting high importance. However, these methods assign a uniform block size across all attention heads, implicitly assuming homogeneous behavior throughout the model. Our analysis reveals that this assumption is flawed: attention heads exhibit widely varying sensitivity to block granularity, and uniformity leads to suboptimal accuracy. We present AB-Sparse, a training-free algorithm-system co-designed framework that improves accuracy while preserving throughput. AB-Sparse introduces lightweight adaptive block size allocation across attention heads to improve accuracy. To compensate for the additional memory overhead, it further employs lossless block centroid quantization. In addition, custom GPU kernels are developed to support efficient execution with variable block sizes. Evaluation results demonstrate that AB-Sparse achieves an accuracy improvement of up to 5.43% over existing block sparse attention baselines without throughput overhead.
翻译:随着大语言模型处理更长上下文的能力提升,注意力计算过程中加载不断增长的键值缓存已成为关键性能瓶颈。已有研究表明,注意力计算主要集中在一小部分令牌子集上。这催生了块稀疏注意力方法——将键值缓存划分为固定大小的块,并选择性计算那些具有高重要性的块。然而,这些方法为所有注意力头分配统一块大小,隐含假设模型内部具有同质性。我们的分析揭示这一假设存在缺陷:不同注意力头对块粒度的敏感性差异显著,统一设置会导致次优的准确率。本文提出AB-Sparse——一种无需训练的算法-系统协同设计框架,在保持吞吐量的同时提升准确率。AB-Sparse通过跨注意力头的轻量级自适应块大小分配机制提高准确率,并采用无损块质心量化补偿额外内存开销。此外,我们开发了定制化GPU内核以支持可变块大小的高效执行。评估结果表明,相比现有块稀疏注意力基线方法,AB-Sparse在无吞吐量损失的情况下实现了高达5.43%的准确率提升。