There are individual differences in expressive behaviors driven by cultural norms and personality. This between-person variation can result in reduced emotion recognition performance. Therefore, personalization is an important step in improving the generalization and robustness of speech emotion recognition. In this paper, to achieve unsupervised personalized emotion recognition, we first pre-train an encoder with learnable speaker embeddings in a self-supervised manner to learn robust speech representations conditioned on speakers. Second, we propose an unsupervised method to compensate for the label distribution shifts by finding similar speakers and leveraging their label distributions from the training set. Extensive experimental results on the MSP-Podcast corpus indicate that our method consistently outperforms strong personalization baselines and achieves state-of-the-art performance for valence estimation.
翻译:个体在表达行为上存在由文化规范和性格驱动的差异。这种人际间的变异可能导致情感识别性能下降。因此,个性化是提升语音情感识别泛化能力和鲁棒性的关键步骤。本文为实现无监督个性化情感识别,首先采用自监督方式预训练一个带有可学习说话者嵌入的编码器,以学习基于说话者条件的鲁棒语音表征。其次,我们提出一种无监督方法,通过寻找相似说话者并利用其在训练集中的标签分布来补偿标签分布偏移。在MSP-Podcast语料库上的大量实验结果表明,我们的方法持续优于强个性化基线,并在效价估计上达到了最先进的性能。