Contrastive learning based pretraining methods have recently exhibited impressive success in diverse fields. In this paper, we propose GEmo-CLAP, a kind of efficient gender-attribute-enhanced contrastive language-audio pretraining (CLAP) model for speech emotion recognition. To be specific, we first build an effective emotion CLAP model Emo-CLAP for emotion recognition, utilizing various self-supervised learning based pre-trained models. Then, considering the importance of the gender attribute in speech emotion modeling, two GEmo-CLAP approaches are further proposed to integrate the emotion and gender information of speech signals, forming more reasonable objectives. Extensive experiments on the IEMOCAP corpus demonstrate that our proposed two GEmo-CLAP approaches consistently outperform the baseline Emo-CLAP with different pre-trained models, while also achieving superior recognition performance compared with other state-of-the-art methods.
翻译:基于对比学习的预训练方法近年来在多个领域展现出令人瞩目的成功。本文提出GEmo-CLAP,一种面向语音情感识别的、高效性别属性增强对比语言-音频预训练(CLAP)模型。具体而言,我们首先构建了一个名为Emo-CLAP的高效情感识别CLAP模型,该模型利用多种自监督学习预训练模型。随后,考虑到性别属性在语音情感建模中的重要性,进一步提出了两种GEmo-CLAP方法,将语音信号的性别信息与情感特征相融合,形成更具合理性的学习目标。在IEMOCAP语料库上的大量实验表明,我们提出的两种GEmo-CLAP方法在不同预训练模型下均持续优于基线Emo-CLAP,同时与其他最先进方法相比也取得了更优的识别性能。