Keyword spotting in edge devices is becoming increasingly important as voice-activated assistants are widely used. However, its deployment is often limited by the extreme low-power constraints of the target embedded systems. Here, we explore the Temporal Difference Encoder (TDE) performance in keyword spotting. This recent neuron model encodes the time difference in instantaneous frequency and spike count to perform efficient keyword spotting with neuromorphic processors. We use the TIdigits dataset of spoken digits with a formant decomposition and rate-based encoding into spikes. We compare three Spiking Neural Networks (SNNs) architectures to learn and classify spatio-temporal signals. The proposed SNN architectures are made of three layers with variation in its hidden layer composed of either (1) feedforward TDE, (2) feedforward Current-Based Leaky Integrate-and-Fire (CuBa-LIF), or (3) recurrent CuBa-LIF neurons. We first show that the spike trains of the frequency-converted spoken digits have a large amount of information in the temporal domain, reinforcing the importance of better exploiting temporal encoding for such a task. We then train the three SNNs with the same number of synaptic weights to quantify and compare their performance based on the accuracy and synaptic operations. The resulting accuracy of the feedforward TDE network (89%) is higher than the feedforward CuBa-LIF network (71%) and close to the recurrent CuBa-LIF network (91%). However, the feedforward TDE-based network performs 92% fewer synaptic operations than the recurrent CuBa-LIF network with the same amount of synapses. In addition, the results of the TDE network are highly interpretable and correlated with the frequency and timescale features of the spoken keywords in the dataset. Our findings suggest that the TDE is a promising neuron model for scalable event-driven processing of spatio-temporal patterns.
翻译:随着语音助手广泛应用,边缘设备上的关键词检测日益重要。然而,目标嵌入式系统的极低功耗限制常常制约其部署。本研究探索了时间差分编码器在关键词检测任务中的性能表现。该新型神经元模型通过编码瞬时频率与脉冲计数的时间差,可在神经形态处理器上实现高效关键词检测。我们采用TIdigits口语数字数据集,通过共振峰分解与基于发放率的脉冲编码进行处理。为学习与分类时空信号,我们比较了三种脉冲神经网络架构。所提出的SNN架构均包含三层,其隐藏层分别由以下神经元构成:(1)前馈式TDE神经元,(2)前馈式电流泄漏积分发放神经元,或(3)递归式CuBa-LIF神经元。我们首先证明频率转换后的口语数字脉冲序列在时域包含大量信息,这凸显了对此类任务更好利用时间编码的重要性。随后,我们以相同突触权重数量训练三种SNN,基于准确率与突触操作次数量化比较其性能。前馈TDE网络准确率(89%)高于前馈CuBa-LIF网络(71%),并接近递归CuBa-LIF网络(91%)。值得注意的是,在突触数量相同条件下,基于前馈TDE网络的突触操作次数比递归CuBa-LIF网络减少92%。此外,TDE网络的结果具有高度可解释性,且与数据集中口语关键词的频率及时标特征高度相关。我们的研究结果表明,TDE是处理时空模式的可扩展事件驱动型神经模型中具有前景的选择。