Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting.
翻译:说话人嵌入中蕴含着丰富的情感相关信息,使其成为增强语音情感识别(SER)的有前景资源,尤其在标注数据有限的情况下。传统上,人们认为情感信息被间接嵌入在说话人嵌入中,导致其未被充分利用。本研究通过说话人内部簇的形式,揭示了情感与当前最优说话人嵌入之间存在直接且有效的关联。通过深入的聚类分析,我们证明情感信息能够直接从说话人嵌入中提取。为利用这一信息,我们提出了一种针对无情感标注数据的对比预训练方法,用于语音情感识别。该方法基于说话人嵌入的说话人内部簇进行正负样本采样。该策略利用大量无情感标注数据,无论是作为独立的预训练任务还是集成到多任务预训练框架中,都能显著提升语音情感识别的性能。