Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.
翻译:语音自监督学习(SSL)在各种下游任务中已展现出显著效能。然而,当前主流自监督模型往往忽略融入情感相关的先验信息,从而未能利用语音中的情感先验知识来提升情感任务理解能力。本文提出一种融合强度知识的情感感知语音表示学习方法。具体而言,我们使用成熟的语音情感理解模型提取帧级情感强度,进而提出一种新颖的情感掩码策略(EMS),将情感强度信息融入掩码过程。我们选取了基于Transformer和CNN的两个代表性模型——MockingJay与非自回归预测编码(NPC),并在IEMOCAP数据集上进行了实验。实验结果表明,我们方法所获得的表示在语音情感识别任务中优于原始模型。