Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.
翻译:零样本跨语言语音情感识别(SER)因语言间分布不匹配以及目标语言缺乏情感标注而仍具挑战性。在此条件下,仅基于源语言数据训练的模型在评估未见过的目标语言时,常面临泛化性能下降的问题。为应对这一局限,我们提出一种结合监督对比学习和说话人对抗学习的情绪判别性表示学习方法。对比学习促进跨语言情感对齐,而说话人对抗学习抑制与说话人相关的线索,以推动说话人不变表示。在零样本跨语言SER设定下的实验结果表明,与常规训练策略相比,所提方法显著提升了SER性能。