Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.
翻译:语音情感识别(SER)旨在从音频信号中识别说话者的情感状态。尽管近年来深度学习在印欧语系语言中显著提升了SER性能,但由于方言多样性、标注数据集有限以及局部频谱线索与长程时序依赖建模的困难,阿拉伯语SER仍处于未充分探索且充满挑战的状态。为应对这些局限性,本研究探讨了联合建模空间与上下文信息的混合架构是否能提升阿拉伯语音情感识别性能。我们提出并评估了一个包含三种架构的比较框架:CNN-LSTM模型、CNN-Transformer模型以及微调后的wav2vec 2.0模型。前两种模型基于MFCC和语谱图表示,而wav2vec 2.0通过自监督表示直接处理原始音频。在EYASE和BAVED数据集上的实验表明,所提出的CNN-Transformer架构显著优于其他模型,准确率达到98.1%。这一结果凸显了卷积特征提取与基于Transformer的全局上下文建模相结合的有效性。本研究的主要贡献在于系统性地比较了混合与自监督方法在阿拉伯语SER中的表现,并证明了CNN-Transformer架构在低资源和方言多样性场景下捕捉频谱与长程依赖关系的鲁棒性。