Speech applications are expected to be low-power and robust under noisy conditions. An effective Voice Activity Detection (VAD) front-end lowers the computational need. Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient. However, SNN-based VADs have yet to achieve noise robustness and often require large models for high performance. This paper introduces a novel SNN-based VAD model, referred to as sVAD, which features an auditory encoder with an SNN-based attention mechanism. Particularly, it provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms. The classifier utilizes Spiking Recurrent Neural Networks (sRNN) to exploit temporal speech information. Experimental results demonstrate that our sVAD achieves remarkable noise robustness and meanwhile maintains low power consumption and a small footprint, making it a promising solution for real-world VAD applications.
翻译:语音应用要求低功耗并在噪声环境中保持鲁棒性。高效的语音活动检测(VAD)前端可降低计算需求。脉冲神经网络(SNN)因其生物合理性和能效优势而闻名,但现有基于SNN的VAD尚未实现噪声鲁棒性,且常需大型模型才能达到高性能。本文提出一种新型SNN-based VAD模型sVAD,其特点在于采用含SNN注意力机制的听觉编码器。具体而言,模型通过SincNet和一维卷积实现高效听觉特征表示,并借助注意力机制提升噪声鲁棒性。分类器利用脉冲递归神经网络(sRNN)挖掘语音时域信息。实验结果表明,sVAD在保持低功耗与小体积的同时,展现出卓越的噪声鲁棒性,为实际VAD应用提供了极具前景的解决方案。