Speech emotion recognition is an important component of modern human-computer interaction systems. However, many state-of-the-art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM-SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM-based framework. Evaluated on the RAVDESS dataset under strict speaker-independent partitioning, the proposed model outperforms conventional attention-based LSTM baselines and several previously reported CNN- and hybrid CNN-LSTM architectures in terms of unweighted average recall (UAR). The best-performing variant (ResLSTM-SA-h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised alternatives, thereby enabling efficient deployment on edge devices and real-time voice assistants. The source code is available at https://github.com/Mak-Sim/ResLSTM-SER.
翻译:语音情感识别是现代人机交互系统中的重要组成部分。然而,许多最先进的方法依赖于具有高计算和内存需求的大型预训练模型,从而限制了其应用范围。本文提出ResLSTM-SA,一种轻量级架构,该架构在基于LSTM的框架中集成了残差连接与软注意力机制。在RAVDESS数据集上,通过严格的说话人独立划分进行评估,所提出的模型在非加权平均召回率方面优于传统的基于注意力的LSTM基线模型以及先前报道的几种CNN和混合CNN-LSTM架构。性能最佳的变体(ResLSTM-SA-h64)仅使用46.8k个可训练参数便达到了0.6517的最大UAR,在参数数量比大型自监督替代方案少三个数量级的情况下,提供具有竞争力的准确率,从而能够高效部署在边缘设备和实时语音助手上。源代码可在https://github.com/Mak-Sim/ResLSTM-SER获取。