The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, there is still potential for enhancement in the performance of these methods. In this paper, we present GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning), a novel HuBERT-based adaptive transfer learning framework for SER. Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level gender-augmented multi-scale pseudo-labels. Then, to fully leverage both obtained frame-level and utterance-level emotion labels, we incorporate model retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a WAR of 80.0\% and a UAR of 82.0\%, surpassing state-of-the-art unimodal SER methods, while also yielding comparable results with multimodal SER approaches.
翻译:预训练语音模型的持续演进极大推动了语音情感识别(SER)的发展。然而,这些方法的性能仍有提升空间。本文提出GMP-ATL(性别增强多尺度伪标签自适应迁移学习),一种基于HuBERT的新型自适应迁移学习框架用于SER。具体而言,GMP-ATL首先利用预训练HuBERT,通过多任务学习和多尺度k-means聚类获取帧级性别增强多尺度伪标签。随后,为充分利用所获取的帧级和话语级情感标签,我们结合模型重训练与微调方法进一步优化GMP-ATL。在IEMOCAP上的实验表明,我们的GMP-ATL实现了优越的识别性能,WAR达80.0%、UAR达82.0%,超越了最先进的单模态SER方法,并与多模态SER方法取得了可比结果。