Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best UAR of 81.43% and WAR of 83.16%, which performs better than state-of-the-art SER methods by at least 3%. Our system is open-sourced on Github.
翻译:跨模态对比预训练近期已在多个领域展现出令人瞩目的成功,然而其在语音情感识别(SER)中的优势尚缺乏充分研究。本文提出GEmo-CLAP——一种面向SER的性别属性增强对比语言-音频预训练(CLAP)方法。具体而言,我们首先利用预训练的文本与音频编码器,构建了用于SER的高效情感CLAP(Emo-CLAP)模型。其次,鉴于性别信息在SER中的重要性,进一步提出两种新型模型:基于多任务学习的GEmo-CLAP(ML-GEmo-CLAP)和基于软标签的GEmo-CLAP(SL-GEmo-CLAP),通过融入语音信号的性别信息,形成更合理的优化目标。在IEMOCAP数据集上的实验表明,两种提出的GEmo-CLAP模型均在不同预训练模型配置下持续优于Emo-CLAP。尤为突出的是,基于WavLM的SL-GEmo-CLAP取得了最佳UAR(81.43%)和WAR(83.16%),较现有最先进的SER方法提升至少3%。我们的系统已在Github上开源。