Multilayer Dataflow based Butterfly Sparsity Orchestration to Accelerate Attention Workloads

Recent neural networks (NNs) with self-attention exhibit competitiveness across different AI domains, but the essential attention mechanism brings massive computation and memory demands. To this end, various sparsity patterns are introduced to reduce the quadratic computation complexity, among which the structured butterfly sparsity has been proven efficient in computation reduction while maintaining model accuracy. However, its complicated data accessing pattern brings utilization degradation and makes parallelism hard to exploit in general block-oriented architecture like GPU. Since the reconfigurable dataflow architecture is known to have better data reusability and architectural flexibility in general NN-based acceleration, we want to apply it to the butterfly sparsity for acquiring better computational efficiency for attention workloads. We first propose a hybrid butterfly-sparsity network to obtain better trade-offs between attention accuracy and performance. Next, we propose a scalable multilayer dataflow method supported by coarse-grained streaming parallelism designs, to orchestrate the butterfly sparsity computation on the dataflow array. The experiments show that compared with Jetson Xavier NX, our design has a speedup of up to $14.34\times$ ($9.29\times$ on average) as well as $11.14\times$ energy efficiency advancement in attention workloads. In comparison with SOTA attention accelerators of the same peak performance, our dataflow architecture acquires $2.38\times$-$4.7\times$ efficiency improvement as well as $6.60\times$-$15.37\times$ energy reduction with butterfly sparsity optimization.

翻译：近年来，具备自注意力机制的神经网络在不同人工智能领域展现出竞争力，但核心的注意力机制带来了巨大的计算与内存需求。为此，研究者引入了多种稀疏模式以降低二次计算复杂度，其中结构化蝶形稀疏已被证明能在保持模型精度的同时有效减少计算量。然而，其复杂的数据访问模式会导致利用率下降，并使得在GPU等通用块导向架构中难以有效利用并行性。由于可重构数据流架构在基于神经网络的通用加速中具有更好的数据重用性和架构灵活性，我们希望将其应用于蝶形稀疏，从而为注意力计算负载获取更优的计算效率。我们首先提出一种混合蝶形稀疏网络，以在注意力精度与性能之间获得更好的权衡。接着，我们提出一种由粗粒度流并行设计支持的可扩展多层数据流方法，用以在数据流阵列上编排蝶形稀疏计算。实验表明，与Jetson Xavier NX相比，我们的设计在注意力负载上实现了高达$14.34\times$（平均$9.29\times$）的加速比以及$11.14\times$的能效提升。与具有相同峰值性能的先进注意力加速器相比，我们的数据流架构通过蝶形稀疏优化实现了$2.38\times$-$4.7\times$的效率提升以及$6.60\times$-$15.37\times$的能耗降低。