This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.
翻译:本文提出混合稀疏注意力(HySparse)这一新型架构,该架构在每一层完整注意力层之间交错插入若干稀疏注意力层。尽管概念简洁,HySparse策略性地从前一完整注意力层直接推导出每个稀疏层的令牌选择与KV缓存。该架构解决了先前稀疏注意力方法的两个根本性局限:其一,传统方法通常依赖额外代理指标来预测令牌重要性,这引入了额外复杂度并可能导致次优性能;相比之下,HySparse利用完整注意力层作为精确的预言机制来识别重要令牌。其二,现有稀疏注意力设计往往仅降低计算量而未能节省KV缓存。HySparse使稀疏注意力层能够复用完整注意力层的KV缓存,从而同步降低计算量与内存占用。我们在7B稠密模型与80B混合专家(MoE)模型上评估HySparse。在所有实验设置中,HySparse均持续优于完整注意力基准与混合滑动窗口注意力(SWA)基线。值得注意的是,在总层数为49层的80B MoE模型中,仅5层采用完整注意力,但HySparse在将KV缓存存储降低近10倍的同时仍实现了显著的性能提升。