硬件加速的图神经网络：面向SoC FPGA的神经形态事件音频分类与关键词检测的替代方案 (Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA)

Hardware-accelerated graph neural networks: an alternative approach for neuromorphic event-based audio classification and keyword spotting on SoC FPGA

翻译：硬件加速的图神经网络：面向SoC FPGA的神经形态事件音频分类与关键词检测的替代方案

Kamil Jeziorek,Piotr Wzorek,Krzysztof Blachut,Hiroshi Nakano,Manon Dampfhoffer,Thomas Mesquida,Hiroaki Nishi,Thomas Dalgaty,Tomasz Kryjak

from arxiv, Under revision in TRETS Journal

As the volume of data recorded by embedded edge sensors increases, particularly from neuromorphic devices producing discrete event streams, there is a growing need for hardware-aware neural architectures that enable efficient, low-latency, and energy-conscious local processing. We present an FPGA implementation of event-graph neural networks for audio processing. We utilise an artificial cochlea that converts time-series signals into sparse event data, reducing memory and computation costs. Our architecture was implemented on a SoC FPGA and evaluated on two open-source datasets. For classification task, our baseline floating-point model achieves 92.7% accuracy on SHD dataset - only 2.4% below the state of the art - while requiring over 10x and 67x fewer parameters. On SSC, our models achieve 66.9-71.0% accuracy. Compared to FPGA-based spiking neural networks, our quantised model reaches 92.3% accuracy, outperforming them by up to 19.3% while reducing resource usage and latency. For SSC, we report the first hardware-accelerated evaluation. We further demonstrate the first end-to-end FPGA implementation of event-audio keyword spotting, combining graph convolutional layers with recurrent sequence modelling. The system achieves up to 95% word-end detection accuracy, with only 10.53 microsecond latency and 1.18 W power consumption, establishing a strong benchmark for energy-efficient event-driven KWS.

翻译：随着嵌入式边缘传感器记录的数据量不断增加，特别是神经形态设备产生的离散事件流，对能够实现高效、低延迟且节能的本地处理的硬件感知神经架构的需求日益增长。我们提出了一种用于音频处理的事件图神经网络的FPGA实现方案。我们采用人工耳蜗将时间序列信号转换为稀疏事件数据，从而降低了内存和计算成本。该架构在SoC FPGA上实现，并在两个开源数据集上进行了评估。在分类任务中，我们的基线浮点模型在SHD数据集上达到了92.7%的准确率——仅比最先进水平低2.4%——同时所需参数量减少了10倍以上和67倍。在SSC数据集上，我们的模型达到了66.9-71.0%的准确率。与基于FPGA的脉冲神经网络相比，我们的量化模型达到了92.3%的准确率，以高达19.3%的优势超越前者，同时减少了资源使用和延迟。对于SSC，我们首次报告了硬件加速的评估结果。我们进一步展示了首个端到端的FPGA实现的事件音频关键词检测系统，结合了图卷积层与循环序列建模。该系统实现了高达95%的词尾检测准确率，延迟仅为10.53微秒，功耗为1.18瓦，为节能的事件驱动关键词检测建立了强有力的基准。