Performance in Speech Emotion Recognition (SER) on a single language has increased greatly in the last few years thanks to the use of deep learning techniques. However, cross-lingual SER remains a challenge in real-world applications due to two main factors: the first is the big gap among the source and the target domain distributions; the second factor is the major availability of unlabeled utterances in contrast to the labeled ones for the new language. Taking into account previous aspects, we propose a Semi-Supervised Learning (SSL) method for cross-lingual emotion recognition when only few labeled examples in the target domain (i.e. the new language) are available. Our method is based on a Transformer and it adapts to the new domain by exploiting a pseudo-labeling strategy on the unlabeled utterances. In particular, the use of a hard and soft pseudo-labels approach is investigated. We thoroughly evaluate the performance of the proposed method in a speaker-independent setup on both the source and the new language and show its robustness across five languages belonging to different linguistic strains. The experimental findings indicate that the unweighted accuracy is increased by an average of 40% compared to state-of-the-art methods.
翻译:近年来,得益于深度学习技术的应用,单一语言上的语音情感识别(SER)性能已大幅提升。然而,跨语言SER在实际应用中仍面临两大挑战:其一,源域与目标域分布之间存在显著差距;其二,针对新语言,无标签语音数据远多于有标签数据。针对上述问题,本文提出一种基于半监督学习(SSL)的跨语言情感识别方法,适用于目标域(即新语言)仅有少量带标签样本的场景。该方法以Transformer为基础,通过利用无标签语音上的伪标签策略实现对新领域的适应,并重点探究了硬伪标签与软伪标签两种方案。我们在说话人无关的设置下,对源语言及新语言上的方法性能进行了全面评估,实验结果表明该方法在五种不同语系的语言上均展现出鲁棒性。与现有最先进方法相比,未加权准确率平均提升了40%。