SEA: Sparse Linear Attention with Estimated Attention Mask

The transformer architecture has made breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, transformers struggle with long sequences due to the quadratic complexity of the attention operation, and previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix, and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches may also lose interpretability if they do not produce full quadratic attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then creates a sparse approximation to the full attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show a roughly two-fold worse perplexity scores over the quadratic OPT-125M baseline, while SEA achieves an even better perplexity than OPT-125M, using roughly half as much memory as OPT-125M. Moreover, SEA maintains an interpretable attention matrix and can utilize knowledge distillation to lower the complexity of existing pretrained transformers. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.

翻译：Transformer架构在需要对序列元素间成对关系建模的任务（如自然语言理解）中近年来取得了突破性进展。然而，由于注意力操作的二次复杂度，Transformer难以处理长序列，此前的研究通过稀疏化或线性近似注意力矩阵来降低复杂度。但这些方法无法直接蒸馏教师模型的注意力矩阵知识，且通常需要完全重新训练。此外，若无法生成完整的二次复杂度注意力矩阵，先前的稀疏与线性方法还可能丧失可解释性。为解决这些问题，我们提出SEA：基于估计注意力掩膜的稀疏线性注意力机制。SEA通过基于核的线性注意力以线性复杂度估计注意力矩阵，随后采用top-k选择构建全注意力矩阵的稀疏近似，以执行稀疏注意力操作。在语言建模任务（Wikitext2）中，此前线性与稀疏注意力方法的困惑度得分约为二次复杂度OPT-125M基线的两倍，而SEA在内存消耗仅为OPT-125M约一半的情况下，取得了更优的困惑度。此外，SEA保持了可解释的注意力矩阵，并能利用知识蒸馏降低现有预训练Transformer的复杂度。我们相信这项工作将产生显著的实际影响，因为它使得在资源受限设备上以更低内存运行大型Transformer成为可能。