Cross-corpus speech emotion recognition (SER) seeks to generalize the ability of inferring speech emotion from a well-labeled corpus to an unlabeled one, which is a rather challenging task due to the significant discrepancy between two corpora. Existing methods, typically based on unsupervised domain adaptation (UDA), struggle to learn corpus-invariant features by global distribution alignment, but unfortunately, the resulting features are mixed with corpus-specific features or not class-discriminative. To tackle these challenges, we propose a novel Emotion Decoupling aNd Alignment learning framework (EMO-DNA) for cross-corpus SER, a novel UDA method to learn emotion-relevant corpus-invariant features. The novelties of EMO-DNA are two-fold: contrastive emotion decoupling and dual-level emotion alignment. On one hand, our contrastive emotion decoupling achieves decoupling learning via a contrastive decoupling loss to strengthen the separability of emotion-relevant features from corpus-specific ones. On the other hand, our dual-level emotion alignment introduces an adaptive threshold pseudo-labeling to select confident target samples for class-level alignment, and performs corpus-level alignment to jointly guide model for learning class-discriminative corpus-invariant features across corpora. Extensive experimental results demonstrate the superior performance of EMO-DNA over the state-of-the-art methods in several cross-corpus scenarios. Source code is available at https://github.com/Jiaxin-Ye/Emo-DNA.
翻译:摘要:跨语料库语音情感识别旨在将从标注充分的语料库中推断语音情感的能力泛化至未标注语料库。由于两个语料库之间存在显著差异,这是一项极具挑战性的任务。现有方法通常基于无监督域适应,通过全局分布对齐学习语料库不变特征,但遗憾的是,由此生成的特征往往混杂了语料库特定特征或缺乏类别判别性。为应对这些挑战,我们提出了一种新颖的情绪解耦与对齐学习框架(EMO-DNA)用于跨语料库语音情感识别,这是一种学习情绪相关语料库不变特征的新型无监督域适应方法。EMO-DNA的创新性体现在两方面:对比情绪解耦与双层级情绪对齐。一方面,我们的对比情绪解耦通过对比解耦损失实现解耦学习,以增强情绪相关特征与语料库特定特征的可分离性。另一方面,我们的双层级情绪对齐引入自适应阈值伪标签技术,筛选置信度高的目标样本进行类别级对齐,并执行语料库级对齐以联合引导模型,学习跨语料库中具有类别判别性的语料库不变特征。大量实验结果表明,在多个跨语料库场景下,EMO-DNA的性能优于现有最先进方法。源代码已开源至https://github.com/Jiaxin-Ye/Emo-DNA。