Speech emotion recognition systems have high prediction latency because of the high computational requirements for deep learning models and low generalizability mainly because of the poor reliability of emotional measurements across multiple corpora. To solve these problems, we present a speech emotion recognition system based on a reductionist approach of decomposing and analyzing syllable-level features. Mel-spectrogram of an audio stream is decomposed into syllable-level components, which are then analyzed to extract statistical features. The proposed method uses formant attention, noise-gate filtering, and rolling normalization contexts to increase feature processing speed and tolerance to adversity. A set of syllable-level formant features is extracted and fed into a single hidden layer neural network that makes predictions for each syllable as opposed to the conventional approach of using a sophisticated deep learner to make sentence-wide predictions. The syllable level predictions help to achieve the real-time latency and lower the aggregated error in utterance level cross-corpus predictions. The experiments on IEMOCAP (IE), MSP-Improv (MI), and RAVDESS (RA) databases show that the method archives real-time latency while predicting with state-of-the-art cross-corpus unweighted accuracy of 47.6% for IE to MI and 56.2% for MI to IE.
翻译:语音情感识别系统存在较高的预测延迟,这主要源于深度学习模型的高计算需求;同时其泛化能力普遍较低,根本原因在于跨语料库的情感测量可靠性不足。为解决这些问题,我们提出一种基于还原论方法的语音情感识别系统,通过分解和分析音节级特征实现。音频流的梅尔频谱图被分解为音节级成分,进而分析提取统计特征。该方法采用共振峰注意力机制、噪声门滤波和滚动归一化上下文,以提高特征处理速度并增强对噪声环境的鲁棒性。我们提取一组音节级共振峰特征,并将其输入单隐层神经网络,该网络对每个音节进行独立预测——这与传统方法中使用复杂深度学习模型对整个句子进行预测形成对比。音节级预测有助于实现实时响应延迟,并降低语句级跨语料库预测中的累积误差。在IEMOCAP(IE)、MSP-Improv(MI)和RAVDESS(RA)数据库上的实验表明,该方法在实现实时延迟的同时,取得了当前最先进的跨语料库未加权准确率:IE到MI为47.6%,MI到IE为56.2%。