Given the significance of speech emotion recognition, numerous methods have been developed in recent years to create effective and efficient systems in this domain. One of these methods involves the use of pretrained transformers, fine-tuned to address this specific problem, resulting in high accuracy. Despite extensive discussions and global-scale efforts to enhance these systems, the application of this innovative and effective approach has received less attention in the context of Persian speech emotion recognition. In this article, we review the field of speech emotion recognition and its background, with an emphasis on the importance of employing transformers in this context. We present two models, one based on spectrograms and the other on the audio itself, fine-tuned using the shEMO dataset. These models significantly enhance the accuracy of previous systems, increasing it from approximately 65% to 80% on the mentioned dataset. Subsequently, to investigate the effect of multilinguality on the fine-tuning process, these same models are fine-tuned twice. First, they are fine-tuned using the English IEMOCAP dataset, and then they are fine-tuned with the Persian shEMO dataset. This results in an improved accuracy of 82% for the Persian emotion recognition system. Keywords: Persian Speech Emotion Recognition, shEMO, Self-Supervised Learning
翻译:鉴于语音情感识别的重要性,近年来已开发出众多方法来构建该领域高效能的系统。其中一种方法涉及使用预训练的Transformers,通过微调来解决这一特定问题,从而获得高精度。尽管在全球范围内对提升这些系统已有广泛讨论和努力,但这一创新且有效的方法在波斯语语音情感识别中的应用却较少受到关注。本文回顾了语音情感识别领域及其背景,重点强调了在该背景下使用Transformers的重要性。我们提出了两种模型:一种基于频谱图,另一种基于音频本身,并利用shEMO数据集进行微调。这些模型显著提升了先前系统的准确率,在上述数据集上从约65%提高到了80%。随后,为探究多语言性对微调过程的影响,我们对相同的模型进行了两次微调。首先使用英语IEMOCAP数据集进行微调,然后使用波斯语shEMO数据集进行微调。最终,波斯语情感识别系统的准确率提升至82%。关键词:波斯语语音情感识别,shEMO,自监督学习