Cross-corpus speech emotion recognition (SER) aims to transfer emotional knowledge from a labeled source corpus to an unlabeled corpus. However, prior methods require access to source data during adaptation, which is unattainable in real-life scenarios due to data privacy protection concerns. This paper tackles a more practical task, namely source-free cross-corpus SER, where a pre-trained source model is adapted to the target domain without access to source data. To address the problem, we propose a novel method called emotion-aware contrastive adaptation network (ECAN). The core idea is to capture local neighborhood information between samples while considering the global class-level adaptation. Specifically, we propose a nearest neighbor contrastive learning to promote local emotion consistency among features of highly similar samples. Furthermore, relying solely on nearest neighborhoods may lead to ambiguous boundaries between clusters. Thus, we incorporate supervised contrastive learning to encourage greater separation between clusters representing different emotions, thereby facilitating improved class-level adaptation. Extensive experiments indicate that our proposed ECAN significantly outperforms state-of-the-art methods under the source-free cross-corpus SER setting on several speech emotion corpora.
翻译:跨语料库语音情感识别旨在将情感知识从带标签的源语料库迁移至无标签的目标语料库。然而,现有方法在适应过程中需要访问源数据,这在现实场景中因数据隐私保护要求而难以实现。本文针对更实际的任务——无源跨语料库语音情感识别,即在不访问源数据的前提下,将预训练的源模型适应到目标域。为解决该问题,我们提出了一种名为情感感知对比自适应网络(ECAN)的新方法。其核心思想是在考虑全局类别级适应的同时,捕捉样本间的局部邻域信息。具体而言,我们提出最近邻对比学习,以促进高度相似样本特征间的局部情感一致性。此外,仅依赖最近邻可能导致簇间边界模糊,因此我们引入监督对比学习,通过增强不同情感簇间的分离度来提升类别级适应效果。大量实验表明,在多个语音情感语料库的无源跨语料库语音情感识别设定下,所提出的ECAN方法显著优于现有最优方法。