Spiking Neural Networks (SNNs) have been recently integrated into Transformer architectures due to their potential to reduce computational demands and to improve power efficiency. Yet, the implementation of the attention mechanism using spiking signals on general-purpose computing platforms remains inefficient. In this paper, we propose a novel framework leveraging stochastic computing (SC) to effectively execute the dot-product attention for SNN-based Transformers. We demonstrate that our approach can achieve high classification accuracy ($83.53\%$) on CIFAR-10 within 10 time steps, which is comparable to the performance of a baseline artificial neural network implementation ($83.66\%$). We estimate that the proposed SC approach can lead to over $6.3\times$ reduction in computing energy and $1.7\times$ reduction in memory access costs for a digital CMOS-based ASIC design. We experimentally validate our stochastic attention block design through an FPGA implementation, which is shown to achieve $48\times$ lower latency as compared to a GPU implementation, while consuming $15\times$ less power.
翻译:脉冲神经网络(SNN)近期被集成到Transformer架构中,因其具有降低计算需求和提升能效的潜力。然而,在通用计算平台上使用脉冲信号实现注意力机制仍存在效率低下的问题。本文提出一种新型框架,利用随机计算(SC)高效执行基于SNN的Transformer的点积注意力机制。我们证明该方法能在10个时间步内于CIFAR-10数据集上实现高分类准确率(83.53%),与基准人工神经网络实现的性能(83.66%)相当。据估算,所提出的SC方法在基于数字CMOS的专用集成电路设计中可降低超过6.3倍的计算能耗和1.7倍的内存访问成本。我们通过FPGA实现对该随机注意力模块设计进行实验验证,结果显示其相比GPU实现延迟降低48倍,同时功耗减少15倍。