Temporal Information Reconstruction and Non-Aligned Residual in Spiking Neural Networks for Speech Classification

Recently, it can be noticed that most models based on spiking neural networks (SNNs) only use a same level temporal resolution to deal with speech classification problems, which makes these models cannot learn the information of input data at different temporal scales. Additionally, owing to the different time lengths of the data before and after the sub-modules of many models, the effective residual connections cannot be applied to optimize the training processes of these models.To solve these problems, on the one hand, we reconstruct the temporal dimension of the audio spectrum to propose a novel method named as Temporal Reconstruction (TR) by referring the hierarchical processing process of the human brain for understanding speech. Then, the reconstructed SNN model with TR can learn the information of input data at different temporal scales and model more comprehensive semantic information from audio data because it enables the networks to learn the information of input data at different temporal resolutions. On the other hand, we propose the Non-Aligned Residual (NAR) method by analyzing the audio data, which allows the residual connection can be used in two audio data with different time lengths. We have conducted plentiful experiments on the Spiking Speech Commands (SSC), the Spiking Heidelberg Digits (SHD), and the Google Speech Commands v0.02 (GSC) datasets. According to the experiment results, we have achieved the state-of-the-art (SOTA) result 81.02\% on SSC for the test classification accuracy of all SNN models, and we have obtained the SOTA result 96.04\% on SHD for the classification accuracy of all models.

翻译：近来可以注意到，大多数基于脉冲神经网络（SNNs）的模型在处理语音分类问题时仅使用相同级别的时间分辨率，这使得这些模型无法学习输入数据在不同时间尺度上的信息。此外，由于许多模型的子模块前后数据的时间长度不同，有效的残差连接无法应用于优化这些模型的训练过程。为解决这些问题，一方面，我们参考人脑理解语音的层次化处理过程，重构音频频谱的时间维度，提出了一种称为时序重构（TR）的新方法。随后，配备TR的重构SNN模型能够学习输入数据在不同时间尺度上的信息，并从音频数据中建模更全面的语义信息，因为它使网络能够以不同的时间分辨率学习输入数据的信息。另一方面，我们通过分析音频数据提出了非对齐残差（NAR）方法，该方法使得残差连接可以应用于两个时间长度不同的音频数据之间。我们在Spiking Speech Commands（SSC）、Spiking Heidelberg Digits（SHD）和Google Speech Commands v0.02（GSC）数据集上进行了大量实验。根据实验结果，我们在SSC数据集上取得了所有SNN模型测试分类准确率81.02%的当前最优（SOTA）结果，并在SHD数据集上获得了所有模型分类准确率96.04%的SOTA结果。