Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking transformers. By leveraging 3D integration with memory-on-logic and logic-on-logic stacking, we explore such brain-inspired accelerators with spatially stackable circuitry, demonstrating significant optimization of energy efficiency and latency compared to conventional 2D CMOS integration.
翻译:脉冲神经网络(SNNs)提供了一种受大脑启发的、事件驱动的机制,被认为对实现高能效深度学习至关重要。专家混合方法模拟了神经系统的并行分布式处理,引入了条件计算策略,并在不增加计算操作数量的情况下扩展了模型容量。此外,脉冲专家混合自注意力机制增强了表示能力,能有效捕捉视觉或语言标记中实体的多样化模式及其依赖关系。然而,目前尚缺乏对体现类脑计算的脉冲Transformer所需的高度并行分布式处理的硬件支持。本文首次提出了面向专家混合与多头注意力脉冲Transformer的三维硬件架构及设计方法。通过利用存算一体与逻辑堆叠的三维集成技术,我们探索了具有空间可堆叠电路结构的类脑加速器,相较于传统二维CMOS集成方案,在能效与延迟方面均展现出显著的优化效果。