Attention mechanisms play a crucial role in the neural revolution of Natural Language Processing (NLP). With the growth of attention-based models, several pruning techniques have been developed to identify and exploit sparseness, making these models more efficient. Most efforts focus on hard-coding attention patterns or pruning attention weights based on training data. We propose Attention Pruning (AP), a framework that observes attention patterns in a fixed dataset and generates a global sparseness mask. AP saves 90% of attention computation for language modeling and about 50% for machine translation and GLUE tasks, maintaining result quality. Our method reveals important distinctions between self- and cross-attention patterns, guiding future NLP research. Our framework can reduce both latency and memory requirements for any attention-based model, aiding in the development of improved models for existing or new NLP applications. We have demonstrated this with encoder and autoregressive transformer models using Triton GPU kernels and make our code publicly available at https://github.com/irugina/AP.
翻译:注意力机制在自然语言处理(NLP)的神经革命中扮演着关键角色。随着基于注意力模型的增长,多种剪枝技术被开发出来以识别和利用稀疏性,从而提升这些模型的效率。现有研究主要集中于硬编码注意力模式或基于训练数据对注意力权重进行剪枝。我们提出注意力剪枝(Attention Pruning, AP)框架,该框架通过观察固定数据集中的注意力模式生成全局稀疏化掩码。AP可为语言建模任务节省90%的注意力计算量,并在机器翻译和GLUE任务中节省约50%的计算量,同时保持结果质量。该方法揭示了自注意力与交叉注意力模式之间的重要差异,为未来NLP研究提供指导。该框架能够降低任何基于注意力模型的延迟和内存需求,有助于为现有或新型NLP应用开发更优模型。我们已通过使用Triton GPU内核的编码器和自回归Transformer模型验证了该方法的有效性,并将代码开源至https://github.com/irugina/AP。