Speech emotion recognition (SER) is crucial for enhancing affective computing and enriching the domain of human-computer interaction. However, the main challenge in SER lies in selecting relevant feature representations from speech signals with lower computational costs. In this paper, we propose a lightweight SER architecture that integrates attention-based local feature blocks (ALFBs) to capture high-level relevant feature vectors from speech signals. We also incorporate a global feature block (GFB) technique to capture sequential, global information and long-term dependencies in speech signals. By aggregating attention-based local and global contextual feature vectors, our model effectively captures the internal correlation between salient features that reflect complex human emotional cues. To evaluate our approach, we extracted four types of spectral features from speech audio samples: mel-frequency cepstral coefficients, mel-spectrogram, root mean square value, and zero-crossing rate. Through a 5-fold cross-validation strategy, we tested the proposed method on five multi-lingual standard benchmark datasets: TESS, RAVDESS, BanglaSER, SUBESCO, and Emo-DB, and obtained a mean accuracy of 99.65%, 94.88%, 98.12%, 97.94%, and 97.19% respectively. The results indicate that our model achieves state-of-the-art (SOTA) performance compared to most existing methods.
翻译:语音情感识别对于提升情感计算能力、丰富人机交互领域至关重要。然而,该领域的主要挑战在于以较低的计算成本从语音信号中选取相关的特征表示。本文提出一种轻量级语音情感识别架构,该架构集成了基于注意力的局部特征块,以从语音信号中捕获高层次的相关特征向量。我们还引入了全局特征块技术,以捕获语音信号中的序列化全局信息及长期依赖关系。通过聚合基于注意力的局部与全局上下文特征向量,我们的模型能有效捕获反映复杂人类情感线索的显著特征之间的内在关联。为评估所提方法,我们从语音音频样本中提取了四种谱特征:梅尔频率倒谱系数、梅尔频谱图、均方根值以及过零率。通过五折交叉验证策略,我们在五个多语言标准基准数据集上测试了所提方法:TESS、RAVDESS、BanglaSER、SUBESCO和Emo-DB,分别获得了99.65%、94.88%、98.12%、97.94%和97.19%的平均准确率。结果表明,与现有大多数方法相比,我们的模型达到了最先进的性能水平。