Contrastive learning based pretraining methods have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of efficient gender-attribute-enhanced contrastive language-audio pretraining (CLAP) model for speech emotion recognition. To be specific, we first build an effective emotion CLAP model Emo-CLAP for emotion recognition, utilizing various self-supervised learning based pre-trained models. Then, considering the importance of the gender attribute in speech emotion modeling, two GEmo-CLAP approaches are further proposed to integrate the emotion and gender information of speech signals, forming more reasonable objectives. Extensive experiments on the IEMOCAP corpus demonstrate that our proposed two GEmo-CLAP approaches consistently outperform the baseline Emo-CLAP with different pre-trained models, while also achieving superior recognition performance compared with other state-of-the-art methods.
翻译:基于对比学习的预训练方法近期在多个领域展现出显著成效。本文提出GEmo-CLAP,一种高效的性别属性增强对比语言-音频预训练模型,专门用于语音情感识别。具体而言,我们首先构建了用于情感识别的有效情感CLAP模型Emo-CLAP,该模型利用多种基于自监督学习的预训练模型。进而,考虑到性别属性在语音情感建模中的重要性,我们进一步提出两种GEmo-CLAP方法,将语音信号的性别信息与情感信息相融合,形成更合理的目标函数。在IEMOCAP语料库上的大量实验表明,我们提出的两种GEmo-CLAP方法在使用不同预训练模型时均持续优于基线Emo-CLAP,同时与其他最新方法相比取得了更优的识别性能。