Contrastive cross-modality pretraining has recently exhibited impressive success in diverse fields, whereas there is limited research on their merits in speech emotion recognition (SER). In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for SER. Specifically, we first construct an effective emotion CLAP (Emo-CLAP) for SER, using pre-trained text and audio encoders. Second, given the significance of gender information in SER, two novel multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) and soft label based GEmo-CLAP (SL-GEmo-CLAP) models are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP indicate that our proposed two GEmo-CLAPs consistently outperform Emo-CLAP with different pre-trained models. Remarkably, the proposed WavLM-based SL-GEmo-CLAP obtains the best WAR of 83.16\%, which performs better than state-of-the-art SER methods.
翻译:近年来,对比跨模态预训练在多个领域取得了显著成功,但针对其在语音情感识别(Speech Emotion Recognition, SER)中的优势研究仍然有限。本文提出GEmo-CLAP,一种基于性别属性增强的对比语言-音频预训练(Contrastive Language-Audio Pretraining, CLAP)方法,用于SER任务。具体而言,我们首先利用预训练文本编码器和音频编码器构建了有效的情绪对比语言-音频预训练(Emo-CLAP)模型。其次,鉴于性别信息在SER中的重要性,进一步提出两种新颖的多任务学习型GEmo-CLAP(ML-GEmo-CLAP)和软标签型GEmo-CLAP(SL-GEmo-CLAP)模型,以融合语音信号的性别信息,从而构建更合理的优化目标。在IEMOCAP数据集上的实验表明,我们提出的两种GEmo-CLAP模型在不同预训练模型上均持续优于Emo-CLAP。值得注意的是,所提出的基于WavLM的SL-GEmo-CLAP模型取得了最高的加权准确率(WAR)83.16%,优于当前最先进的SER方法。