Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.
翻译:脉冲神经网络(SNNs)凭借其事件驱动的处理范式,为通向高能效的语音命令识别(SCR)提供了一条有前景的路径。然而,由于有限的时间建模能力和基于二进制脉冲的表示,现有的基于SNN的SCR方法通常难以从语音中捕获丰富的时间依赖性和上下文信息。为了应对这些挑战,我们首先引入了多视图脉冲时序感知自注意力(MSTASA)模块,该模块将有效的脉冲时序感知注意力与多视图学习框架相结合,以建模语音命令中互补的时间依赖性。基于MSTASA,我们进一步提出了SpikCommander,这是一种完全由脉冲驱动的Transformer架构,它将MSTASA与一个脉冲上下文细化通道MLP(SCR-MLP)相结合,共同增强时间上下文建模和通道级特征整合。我们在三个基准数据集上评估了我们的方法:Spiking Heidelberg Dataset (SHD)、Spiking Speech Commands (SSC) 和 Google Speech Commands V2 (GSC)。大量实验表明,在可比的时间步长下,SpikCommander以更少的参数持续优于最先进的SNN方法,突显了其在鲁棒语音命令识别方面的有效性和高效性。