AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers

Self-attention-based transformer models have achieved tremendous success in the domain of natural language processing. Despite their efficacy, accelerating the transformer is challenging due to its quadratic computational complexity and large activation sizes. Existing transformer accelerators attempt to prune its tokens to reduce memory access, albeit with high compute overheads. Moreover, previous works directly operate on large matrices involved in the attention operation, which limits hardware utilization. In order to address these challenges, this work proposes a novel dynamic inference scheme, DynaTran, which prunes activations at runtime with low overhead, substantially reducing the number of ineffectual operations. This improves the throughput of transformer inference. We further propose tiling the matrices in transformer operations along with diverse dataflows to improve data reuse, thus enabling higher energy efficiency. To effectively implement these methods, we propose AccelTran, a novel accelerator architecture for transformers. Extensive experiments with different models and benchmarks demonstrate that DynaTran achieves higher accuracy than the state-of-the-art top-k hardware-aware pruning strategy while attaining up to 1.2$\times$ higher sparsity. One of our proposed accelerators, AccelTran-Edge, achieves 330K$\times$ higher throughput with 93K$\times$ lower energy requirement when compared to a Raspberry Pi device. On the other hand, AccelTran-Server achieves 5.73$\times$ higher throughput and 3.69$\times$ lower energy consumption compared to the state-of-the-art transformer co-processor, Energon.

翻译：基于自注意力机制的Transformer模型在自然语言处理领域取得了巨大成功。然而，由于二次计算复杂度和巨大的激活值尺寸，加速Transformer极具挑战性。现有Transformer加速器尝试通过剪枝令牌（token）来减少内存访问，但计算开销较高。此外，先前的研究直接操作注意力机制中的大型矩阵，限制了硬件利用率。为解决这些问题，本文提出了一种新颖的动态推理方案DynaTran，该方案在运行时以低开销剪枝激活值，大幅减少无效操作次数，从而提升Transformer推理吞吐量。我们进一步提出在Transformer操作中对矩阵进行分块，并采用多样化的数据流以提升数据复用，进而实现更高能效。为高效实现上述方法，我们提出了AccelTran——一种新颖的Transformer加速器架构。基于不同模型和基准测试的大量实验表明，DynaTran在精度上优于最先进的硬件感知top-k剪枝策略，同时实现了高达1.2倍的稀疏度提升。我们提出的加速器之一AccelTran-Edge，与树莓派设备相比，吞吐量提升33万倍，能耗降低9.3万倍；另一款加速器AccelTran-Server，与最先进的Transformer协处理器Energon相比，吞吐量提升5.73倍，能耗降低3.69倍。