Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference distribution mismatch, and (2) a capability gap, where models trained purely with sparse attention lack complete gradient flow, preventing them from matching full-attention performance. We propose SSA (Sparse Sparse Attention), a training framework that integrates both sparse and full attention with bidirectional attention-output alignment. We prove that the approximation error scales linearly with the attention mass dropped under sparse attention, and show that SSA's alignment objective substantially reduces this quantity compared to baselines. Experiments demonstrate that SSA achieves state-of-the-art performance under both inference modes, adapts smoothly to varying sparsity budgets, and demonstrates superior long-context capabilities. The code is available at https://github.com/zhenyi4/ssa.
翻译:稀疏注意力降低了完整自注意力的二次复杂度,但仍面临两个挑战:(1) 注意力间隙:将稀疏注意力应用于完整注意力训练的模型时,由于训练-推理分布不匹配会导致性能下降;(2) 能力间隙:仅用稀疏注意力训练的模型缺乏完整的梯度流,使其无法达到完整注意力的性能。我们提出SSA(稀疏稀疏注意力),一种通过双向注意力输出对齐来整合稀疏与完整注意力的训练框架。我们证明了近似误差随稀疏注意力丢弃的注意力质量呈线性增长,并表明与基线方法相比,SSA的对齐目标能显著降低该误差量。实验表明,SSA在两种推理模式下均达到最先进的性能,能平滑适应不同的稀疏度预算,并展现出卓越的长上下文处理能力。代码发布于 https://github.com/zhenyi4/ssa。