Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions

Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected. Consensus grades fail to consider ambiguous instances where a speech sample may contain multiple emotions, as captured through grader opinion uncertainty. We demonstrate that using the probability density function of the emotion grades as targets instead of the commonly used consensus grades, provide better performance on benchmark evaluation sets compared to results reported in the literature. We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model for both dimensional and categorical emotion recognition. Comparing representations obtained from different FMs, we observed that focusing on overall test-set performance can be deceiving, as it fails to reveal the models generalization capacity across speakers and gender. We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models. Finally, we demonstrate that label uncertainty and data-skew pose a challenge to model evaluation, where instead of using the best hypothesis, it is useful to consider the 2- or 3-best hypotheses.

翻译：自发语音情感数据通常包含感知评分，评分者在听取语音文件后分配情感分数。此类感知评分因评分者意见差异而引入标签不确定性。通常采用共识评分作为基准真值来解决评分者差异问题，即选择得票最高的情感。然而，共识评分未能处理语音样本可能包含多种情感的模糊实例，这种模糊性通过评分者意见的不确定性得以体现。我们证明，相较于文献中常用的共识评分，使用情感评分的概率密度函数作为目标，能在基准评估集上获得更优的性能表现。研究表明，基于显著性的基础模型表示选择方法有助于训练出在维度性与分类性情感识别任务上均达到先进水平的语音情感模型。通过比较不同基础模型获得的表示，我们发现仅关注整体测试集性能可能产生误导，因其无法揭示模型在跨说话者及性别维度上的泛化能力。我们论证了跨多测试集的性能评估以及跨性别与说话者的性能分析对于评估情感模型实用价值的重要性。最后，我们证明标签不确定性与数据偏斜对模型评估构成挑战，在此情况下采用最优假设的2-3个次优假设进行评估比仅使用单一最优假设更具参考价值。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日