Contrastive learning based pretraining methods have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of efficient gender-attribute-enhanced contrastive language-audio pretraining (CLAP) model for speech emotion recognition. To be specific, we first build an effective emotion CLAP model Emo-CLAP for emotion recognition, utilizing various self-supervised learning based pre-trained models. Then, considering the importance of the gender attribute in speech emotion modeling, two GEmo-CLAP approaches are further proposed to integrate the emotion and gender information of speech signals, forming more reasonable objectives. Extensive experiments on the IEMOCAP corpus demonstrate that our proposed two GEmo-CLAP approaches consistently outperform the baseline Emo-CLAP with different pre-trained models, while also achieving superior recognition performance compared with other state-of-the-art methods.
翻译:基于对比学习的预训练方法近年来在多个领域展现出显著的成功。本文提出GEmo-CLAP,一种高效的性别属性增强对比语言-音频预训练模型,用于语音情感识别。具体而言,我们首先构建了一个用于情感识别的有效情感CLAP模型Emo-CLAP,该模型利用多种基于自监督学习的预训练模型。随后,考虑到性别属性在语音情感建模中的重要性,进一步提出了两种GEmo-CLAP方法,旨在融合语音信号中的情感与性别信息,形成更合理的目标函数。在IEMOCAP数据集上的大量实验表明,我们所提出的两种GEmo-CLAP方法在不同预训练模型下均持续优于基线模型Emo-CLAP,同时相比其他现有先进方法也取得了更优的识别性能。