Contrastive learning based cross-modality pretraining approaches have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for speech emotion recognition. Specifically, a novel emotion CLAP model (Emo-CLAP) is first built, utilizing pre-trained WavLM and RoBERTa models. Second, given the significance of the gender attribute in speech emotion modeling, two novel soft label based GEmo-CLAP (SL-GEmo-CLAP) and multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) models are further proposed to integrate emotion and gender information of speech signals, forming more reasonable objectives. Extensive experiments on IEMOCAP show that our proposed two GEmo-CLAP models consistently outperform the baseline Emo-CLAP, while also achieving the best recognition performance compared with recent state-of-the-art methods. Noticeably, the proposed SL-GEmo-CLAP model achieves the best UAR of 81.43\% and WAR of 83.16\% which performs better than other state-of-the-art SER methods by at least 3\%.
翻译:基于对比学习的跨模态预训练方法近期在多个领域取得了显著成功。本文提出GEmo-CLAP,一种面向语音情感识别的性别属性增强对比语言-音频预训练(CLAP)方法。具体而言,首先构建一种新型情感CLAP模型(Emo-CLAP),利用预训练的WavLM和RoBERTa模型。其次,考虑到性别属性在语音情感建模中的重要性,进一步提出两种基于软标签的GEmo-CLAP(SL-GEmo-CLAP)和基于多任务学习的GEmo-CLAP(ML-GEmo-CLAP)模型,以整合语音信号的情感与性别信息,形成更合理的目标函数。在IEMOCAP上的大量实验表明,我们提出的两种GEmo-CLAP模型一致优于基线Emo-CLAP,同时与近期最先进方法相比也取得了最佳识别性能。值得注意的是,所提出的SL-GEmo-CLAP模型达到了81.43%的最佳UAR和83.16%的最佳WAR,性能比其他最先进SER方法至少高出3%。