Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention

from arxiv, After posting, we discovered that part of the material included in the manuscript should not have been publicly distributed in this form. We are withdrawing the paper while we address the issue

Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification. Several studies have adopted different methods for SER. However, existing SER methods often struggle to capture subtle emotional variations and generalize across diverse datasets. In this article, we use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception. To further improve robustness and feature diversity, we propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques. MFCC features extracted from the augmented data are processed using a 1D Convolutional Neural Network (CNN) architecture enhanced with channel and spatial attention mechanisms. These attention modules allow the model to highlight key emotional patterns, enhancing its ability to capture subtle variations in speech signals. The proposed method delivers cutting-edge performance, achieving the accuracy of 97.49% for SAVEE, 99.23% for RAVDESS, 89.31% for CREMA-D, 99.82% for TESS, 99.53% for EMO-DB, and 96.39% for EMOVO. Experimental results show new benchmarks in SER, demonstrating the effectiveness of our approach in recognizing emotional expressions with high precision. Our evaluation demonstrates that the integration of advanced Deep Learning (DL) methods substantially enhances generalization across diverse datasets, underscoring their potential to advance SER for real-world deployment in assistive technologies and human-computer interaction.

翻译：语音情感识别传统上依赖于听觉数据分析进行情感分类。已有若干研究采用了不同方法实现SER。然而，现有SER方法往往难以捕捉细微的情感变化，并在多样化数据集上泛化能力有限。本文采用梅尔频率倒谱系数作为谱特征，以弥合计算情感处理与人类听觉感知之间的差距。为进一步提升鲁棒性与特征多样性，我们提出了一种新颖的基于1D-CNN的SER框架，该框架集成了数据增强技术。从增强数据中提取的MFCC特征通过采用通道与空间注意力机制增强的一维卷积神经网络架构进行处理。这些注意力模块使模型能够突出关键情感模式，从而增强其捕捉语音信号中细微变化的能力。所提方法实现了前沿性能，在SAVEE、RAVDESS、CREMA-D、TESS、EMO-DB和EMOVO数据集上分别达到97.49%、99.23%、89.31%、99.82%、99.53%和96.39%的准确率。实验结果表明了SER领域的新基准，证明了我们的方法在高精度识别情感表达方面的有效性。我们的评估表明，先进深度学习方法的集成显著增强了模型在多样化数据集上的泛化能力，凸显了其在辅助技术和人机交互等实际应用场景中推动SER发展的潜力。