Speech emotion recognition from voice messages recorded in the wild

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Emotion datasets used for Speech Emotion Recognition (SER) often contain acted or elicited speech, limiting their applicability in real-world scenarios. In this work, we used the Emotional Voice Messages (EMOVOME) database, including spontaneous voice messages from conversations of 100 Spanish speakers on a messaging app, labeled in continuous and discrete emotions by expert and non-expert annotators. We created speaker-independent SER models using the eGeMAPS features, transformer-based models and their combination. We compared the results with reference databases and analyzed the influence of annotators and gender fairness. The pre-trained Unispeech-L model and its combination with eGeMAPS achieved the highest results, with 61.64% and 55.57% Unweighted Accuracy (UA) for 3-class valence and arousal prediction respectively, a 10% improvement over baseline models. For the emotion categories, 42.58% UA was obtained. EMOVOME performed lower than the acted RAVDESS database. The elicited IEMOCAP database also outperformed EMOVOME in the prediction of emotion categories, while similar results were obtained in valence and arousal. Additionally, EMOVOME outcomes varied with annotator labels, showing superior results and better fairness when combining expert and non-expert annotations. This study significantly contributes to the evaluation of SER models in real-life situations, advancing in the development of applications for analyzing spontaneous voice messages.

翻译：用于语音情感识别的数据集通常包含表演或诱发的情感语音，限制了其在真实场景中的应用。本研究采用情感语音消息数据库，包含来自即时通讯应用中100名西班牙语使用者对话的自发语音消息，并由专家与非专家标注员对其进行连续与离散情感标注。我们基于eGeMAPS特征、Transformer模型及其组合构建了说话人无关的语音情感识别模型。将结果与参考数据库进行对比，并分析了标注员差异及性别公平性影响。预训练的Unispeech-L模型及其与eGeMAPS特征的组合取得了最佳效果，在三分类效价与唤醒度预测中分别达到61.64%和55.57%的未加权准确率，较基线模型提升10%。情感类别预测的未加权准确率为42.58%。EMOVOME数据库性能低于表演型RAVDESS数据库。在情感类别预测中，诱发型IEMOCAP数据库也优于EMOVOME，但效价与唤醒度的预测结果相近。此外，EMOVOME的结果随标注员标注存在差异，结合专家与非专家标注可获得更优结果和更好的公平性。本研究对真实场景下语音情感识别模型的评估具有重要贡献，推动了自发语音消息分析应用的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日