Human emotion understanding is pivotal in making conversational technology mainstream. We view speech emotion understanding as a perception task which is a more realistic setting. With varying contexts (languages, demographics, etc.) different share of people perceive the same speech segment as a non-unanimous emotion. As part of the ACM Multimedia 2023 Computational Paralinguistics ChallengE (ComParE) in the EMotion Share track, we leverage their rich dataset of multilingual speakers and multi-label regression target of 'emotion share' or perception of that emotion. We demonstrate that the training scheme of different foundation models dictates their effectiveness for tasks beyond speech recognition, especially for non-semantic speech tasks like emotion understanding. This is a very complex task due to multilingual speakers, variability in the target labels, and inherent imbalance in the regression dataset. Our results show that HuBERT-Large with a self-attention-based light-weight sequence model provides 4.6% improvement over the reported baseline.
翻译:人类情感理解是推动对话技术普及的关键。我们将语音情感理解视为一种感知任务,这更贴近实际场景。在不同语境(语言、人口统计特征等)下,不同人群对同一语音片段的感知情感存在非一致性。作为ACM多媒体2023计算副语言学挑战赛(ComParE)情感共享赛道的一部分,我们利用其丰富的多语言说话者数据集和以"情感共享"或情感感知为目标的多标签回归任务。我们证明不同基础模型的训练策略决定了其在语音识别之外任务中的有效性,特别是情感理解等非语义语音任务。由于涉及多语言说话者、目标标签的变异性以及回归数据集固有的不平衡性,这是一个极具挑战的任务。我们的结果表明,采用基于自注意力的轻量级序列模型的HuBERT-Large相比基线模型实现了4.6%的性能提升。