The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.
翻译:预训练语音模型的持续演进极大地推动了语音情感识别(SER)的发展。然而,当前研究通常依赖于话语级情感标签,难以充分捕捉单一话语内部情感的复杂性。本文提出GMP-TL,一种新颖的SER框架,它采用基于性别增强多尺度伪标签(GMP)的迁移学习来弥补这一不足。具体而言,GMP-TL首先使用预训练的HuBERT模型,通过多任务学习与多尺度k均值聚类获取帧级GMP。随后,为充分利用帧级GMP与话语级情感标签,本文提出一种两阶段模型微调方法以进一步优化GMP-TL。在IEMOCAP数据集上的实验表明,GMP-TL实现了80.0%的加权准确率(WAR)与82.0%的非加权准确率(UAR),其性能不仅优于当前最先进的单模态SER方法,亦可与多模态SER方法取得相当的结果。