Speech emotion recognition has evolved from research to practical applications. Previous studies of emotion recognition from speech have focused on developing models on certain datasets like IEMOCAP. The lack of data in the domain of emotion modeling emerges as a challenge to evaluate models in the other dataset, as well as to evaluate speech emotion recognition models that work in a multilingual setting. This paper proposes an ensemble learning to fuse results of pre-trained models for emotion share recognition from speech. The models were chosen to accommodate multilingual data from English and Spanish. The results show that ensemble learning can improve the performance of the baseline model with a single model and the previous best model from the late fusion. The performance is measured using the Spearman rank correlation coefficient since the task is a regression problem with ranking values. A Spearman rank correlation coefficient of 0.537 is reported for the test set, while for the development set, the score is 0.524. These scores are higher than the previous study of a fusion method from monolingual data, which achieved scores of 0.476 for the test and 0.470 for the development.
翻译:语音情感识别已从研究阶段发展为实际应用。以往基于语音的情感识别研究主要聚焦于在特定数据集(如IEMOCAP)上构建模型。情感建模领域的数据匮乏对跨数据集模型评估以及多语言场景下语音情感识别模型的泛化能力提出了挑战。本文提出一种集成学习方法,融合多个预训练模型对语音情感共享识别结果。所选模型能够兼容英语与西班牙语的多语言数据。实验结果表明,集成学习可显著提升单一基线模型及基于后期融合的最优模型性能。由于该任务属于具有排序特性的回归问题,采用斯皮尔曼等级相关系数评估模型性能。测试集上获得0.537的斯皮尔曼等级相关系数,开发集得分为0.524。该结果优于此前基于单语数据融合方法的研究(测试集0.476分,开发集0.470分)。