Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, such as the potential loss of essential sequence features due to the uniform fixed pattern applied across all layers, and an increase in the model size resulting from the use of additional parameters to learn sparsity patterns in attention operations. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating a new SPION that achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models, with better evaluation quality.
翻译:稀疏化Transformer引起了广泛关注,因为训练Transformer的计算成本非常高。先前对Transformer进行稀疏化的尝试要么采用固定模式,要么采用数据驱动方法,以减少涉及多头注意力计算的运算数量——这是Transformer的主要瓶颈。然而,现有方法存在不可避免的问题,例如由于所有层采用统一的固定模式而导致关键序列特征可能丢失,以及因使用额外参数学习注意力运算中的稀疏模式而导致的模型尺寸增大。本文提出了一种新颖的Transformer稀疏化方案,该方案融合了卷积滤波器和洪水填充方法,以高效捕获注意力运算中的逐层稀疏模式。我们的稀疏化方法降低了Transformer在训练过程中的计算复杂度和内存占用。我们在GPU上开发了逐层稀疏注意力算法的高效实现,展示了全新的SPION,其相比现有最先进的稀疏Transformer模型实现了高达3.08倍的加速,且评估质量更优。