HPCNeuroNet: Advancing Neuromorphic Audio Signal Processing with Transformer-Enhanced Spiking Neural Networks

This paper presents a novel approach to neuromorphic audio processing by integrating the strengths of Spiking Neural Networks (SNNs), Transformers, and high-performance computing (HPC) into the HPCNeuroNet architecture. Utilizing the Intel N-DNS dataset, we demonstrate the system's capability to process diverse human vocal recordings across multiple languages and noise backgrounds. The core of our approach lies in the fusion of the temporal dynamics of SNNs with the attention mechanisms of Transformers, enabling the model to capture intricate audio patterns and relationships. Our architecture, HPCNeuroNet, employs the Short-Time Fourier Transform (STFT) for time-frequency representation, Transformer embeddings for dense vector generation, and SNN encoding/decoding mechanisms for spike train conversions. The system's performance is further enhanced by leveraging the computational capabilities of NVIDIA's GeForce RTX 3060 GPU and Intel's Core i9 12900H CPU. Additionally, we introduce a hardware implementation on the Xilinx VU37P HBM FPGA platform, optimizing for energy efficiency and real-time processing. The proposed accelerator achieves a throughput of 71.11 Giga-Operations Per Second (GOP/s) with a 3.55 W on-chip power consumption at 100 MHz. The comparison results with off-the-shelf devices and recent state-of-the-art implementations illustrate that the proposed accelerator has obvious advantages in terms of energy efficiency and design flexibility. Through design-space exploration, we provide insights into optimizing core capacities for audio tasks. Our findings underscore the transformative potential of integrating SNNs, Transformers, and HPC for neuromorphic audio processing, setting a new benchmark for future research and applications.

翻译：本文提出了一种神经形态音频处理的新方法，通过将脉冲神经网络（SNN）、Transformer和高性能计算（HPC）的优势整合到HPCNeuroNet架构中。利用Intel N-DNS数据集，我们展示了该系统处理跨多种语言和噪声背景下多样化人声录音的能力。该方法的核心理念在于融合SNN的时间动态特性与Transformer的注意力机制，使模型能够捕捉复杂的音频模式和关联关系。我们的架构HPCNeuroNet采用短时傅里叶变换（STFT）进行时频表征，利用Transformer嵌入生成稠密向量，并通过SNN编码/解码机制实现脉冲序列转换。系统性能进一步通过利用NVIDIA GeForce RTX 3060 GPU和Intel Core i9 12900H CPU的计算能力得到增强。此外，我们提出了一种基于Xilinx VU37P HBM FPGA平台的硬件实现方案，针对能效和实时处理进行了优化。该加速器在100 MHz频率下实现了71.11 Giga-Operations Per Second（GOP/s）的吞吐量，片上功耗仅为3.55 W。与现成设备及近期最先进实现的对比结果表明，该加速器在能效和设计灵活性方面具有明显优势。通过设计空间探索，我们为音频任务的核心容量优化提供了洞见。研究结果凸显了将SNN、Transformer与HPC整合用于神经形态音频处理的变革潜力，为未来研究和应用树立了新的基准。