Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech

Speech emotion recognition has evolved from research to practical applications. Previous studies of emotion recognition from speech have focused on developing models on certain datasets like IEMOCAP. The lack of data in the domain of emotion modeling emerges as a challenge to evaluate models in the other dataset, as well as to evaluate speech emotion recognition models that work in a multilingual setting. This paper proposes an ensemble learning to fuse results of pre-trained models for emotion share recognition from speech. The models were chosen to accommodate multilingual data from English and Spanish. The results show that ensemble learning can improve the performance of the baseline model with a single model and the previous best model from the late fusion. The performance is measured using the Spearman rank correlation coefficient since the task is a regression problem with ranking values. A Spearman rank correlation coefficient of 0.537 is reported for the test set, while for the development set, the score is 0.524. These scores are higher than the previous study of a fusion method from monolingual data, which achieved scores of 0.476 for the test and 0.470 for the development.

翻译：语音情感识别已从研究阶段发展为实际应用。以往基于语音的情感识别研究主要聚焦于在特定数据集（如IEMOCAP）上构建模型。情感建模领域的数据匮乏对跨数据集模型评估以及多语言场景下语音情感识别模型的泛化能力提出了挑战。本文提出一种集成学习方法，融合多个预训练模型对语音情感共享识别结果。所选模型能够兼容英语与西班牙语的多语言数据。实验结果表明，集成学习可显著提升单一基线模型及基于后期融合的最优模型性能。由于该任务属于具有排序特性的回归问题，采用斯皮尔曼等级相关系数评估模型性能。测试集上获得0.537的斯皮尔曼等级相关系数，开发集得分为0.524。该结果优于此前基于单语数据融合方法的研究（测试集0.476分，开发集0.470分）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/