Contrastive cross-modality pretraining approaches have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of gender-attribute-enhanced contrastive language-audio pretraining (CLAP) method for speech emotion recognition (SER).Specifically, an effective emotion CLAP model (Emo-CLAP) is first built, using various self-supervised pre-trained models for SER. Second, given the significance of the gender attribute in speech emotion modeling, two novel soft label based GEmo-CLAP (SL-GEmo-CLAP) and multi-task learning based GEmo-CLAP (ML-GEmo-CLAP) are further proposed to incorporate gender information of speech signals, forming more reasonable objectives. Experiments on IEMOCAP demonstrate that our proposed two GEmo-CLAPs consistently outperform the baseline Emo-CLAP with various pre-trained models, while also achieving the best recognition performance compared with state-of-the-art SER methods. Remarkably, the proposed WavLM-based SL-GEmo-CLAP model achieves the best UAR of 81.43\% and WAR of 83.16\%.
翻译:摘要:对比跨模态预训练方法近年来在各领域取得了显著成功。本文提出GEmo-CLAP,一种用于语音情感识别(SER)的性别属性增强对比语言-音频预训练(CLAP)方法。具体而言,首先构建了一个有效的情感CLAP模型(Emo-CLAP),该模型采用多种自监督预训练模型进行SER。其次,鉴于性别属性在语音情感建模中的重要性,进一步提出了两种新颖的基于软标签的GEmo-CLAP(SL-GEmo-CLAP)和基于多任务学习的GEmo-CLAP(ML-GEmo-CLAP),以融入语音信号的性别信息,形成更合理的优化目标。在IEMOCAP数据集上的实验表明,本文提出的两种GEmo-CLAP方法在不同预训练模型下均一致优于基线Emo-CLAP,同时与当前最优SER方法相比也取得了最佳识别性能。值得注意的是,基于WavLM的SL-GEmo-CLAP模型实现了81.43%的最佳未加权平均召回率(UAR)和83.16%的最佳加权平均召回率(WAR)。