Recently, large models, such as Vision Transformer and BERT, have garnered significant attention due to their exceptional performance. However, their extensive computational requirements lead to considerable power and hardware resource consumption. Brain-inspired computing, characterized by its spike-driven methods, has emerged as a promising approach for low-power hardware implementation. In this paper, we propose an efficient sparse hardware accelerator for Spike-driven Transformer. We first design a novel encoding method that encodes the position information of valid activations and skips non-spike values. This method enables us to use encoded spikes for executing the calculations of linear, maxpooling and spike-driven self-attention. Compared with the single spike input design of conventional SNN accelerators that primarily focus on convolution-based spiking computations, the specialized module for spike-driven self-attention is unique in its ability to handle dual spike inputs. By exclusively utilizing activated spikes, our design fully exploits the sparsity of Spike-driven Transformer, which diminishes redundant operations, lowers power consumption, and minimizes computational latency. Experimental results indicate that compared to existing SNNs accelerators, our design achieves up to 13.24$\times$ and 1.33$\times$ improvements in terms of throughput and energy efficiency, respectively.
翻译:近年来,诸如Vision Transformer和BERT等大型模型因其卓越的性能而备受关注。然而,其庞大的计算需求导致了显著的功耗和硬件资源消耗。以脉冲驱动方法为特征的类脑计算,已成为实现低功耗硬件的一种有前景的途径。本文提出了一种面向脉冲驱动Transformer的高效稀疏硬件加速器。我们首先设计了一种新颖的编码方法,该方法对有效激活的位置信息进行编码并跳过非脉冲值。这种方法使我们能够利用编码后的脉冲来执行线性、最大池化以及脉冲驱动自注意力的计算。与主要专注于基于卷积的脉冲计算的传统SNN加速器的单脉冲输入设计相比,用于脉冲驱动自注意力的专用模块的独特之处在于其能够处理双脉冲输入。通过仅利用激活的脉冲,我们的设计充分挖掘了脉冲驱动Transformer的稀疏性,从而减少了冗余操作,降低了功耗,并最小化了计算延迟。实验结果表明,与现有的SNN加速器相比,我们的设计在吞吐量和能效方面分别实现了高达13.24倍和1.33倍的提升。