AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers

Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing. Despite their efficacy, accelerating the transformer is challenging due to its quadratic computational complexity and large activation sizes. Existing transformer accelerators attempt to prune its tokens to reduce memory access, albeit with high compute overheads. Moreover, previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization. In order to address these challenges, this work proposes a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead, substantially reducing the number of ineffectual operations. This improves the throughput of transformer inference. We further propose tiling the matrices in transformer operations along with diverse dataflows to improve data reuse, thus enabling higher energy efficiency. To effectively implement these methods, we propose AccelTran, a novel accelerator architecture for transformers. Extensive experiments with different models and benchmarks demonstrate that DynaTran achieves higher accuracy than the state-of-the-art top-k hardware-aware pruning strategy while attaining up to 1.2$\times$ higher sparsity. One of our proposed accelerators, AccelTran-Edge, achieves 330K$\times$ higher throughput with 93K$\times$ lower energy requirement when compared to a Raspberry Pi device. On the other hand, AccelTran-Server achieves 5.73$\times$ higher throughput and 3.69$\times$ lower energy consumption compared to the state-of-the-art transformer co-processor, Energon. The simulation source code is available at https://github.com/jha-lab/acceltran.

翻译：基于自注意力的Transformer模型在自然语言处理领域取得了巨大成功。尽管效果显著，但由于其二次计算复杂度和大激活值规模，加速Transformer仍具挑战性。现有Transformer加速器试图通过修剪令牌来减少内存访问，但这会带来高昂的计算开销。此外，先前工作直接处理注意力运算中的大矩阵，限制了硬件利用率。为解决这些问题，本文提出一种新型动态推理方案DynaTran，该方案以低开销在运行时修剪激活值，大幅减少无效操作数量，从而提升Transformer推理吞吐量。我们进一步提出在Transformer运算中对矩阵进行分块处理，并结合多样化数据流以改善数据重用，进而实现更高能效。为有效实现这些方法，我们提出了一种新型加速器架构AccelTran。在不同模型和基准上的大量实验表明，DynaTran在实现高达1.2倍稀疏度的同时，相比最先进的top-k硬件感知修剪策略取得了更高准确率。我们提出的加速器之一AccelTran-Edge，与树莓派设备相比，实现了330K倍的吞吐量提升和93K倍的能量需求降低。另一方面，与最先进的Transformer协处理器Energon相比，AccelTran-Server实现了5.73倍的吞吐量提升和3.69倍的能量消耗降低。仿真源代码见https://github.com/jha-lab/acceltran。