Speech Emotion Recognition (SER) is a challenging task due to limited data and blurred boundaries of certain emotions. In this paper, we present a comprehensive approach to improve the SER performance throughout the model lifecycle, including pre-training, fine-tuning, and inference stages. To address the data scarcity issue, we utilize a pre-trained model, wav2vec2.0. During fine-tuning, we propose a novel loss function that combines cross-entropy loss with supervised contrastive learning loss to improve the model's discriminative ability. This approach increases the inter-class distances and decreases the intra-class distances, mitigating the issue of blurred boundaries. Finally, to leverage the improved distances, we propose an interpolation method at the inference stage that combines the model prediction with the output from a k-nearest neighbors model. Our experiments on IEMOCAP demonstrate that our proposed methods outperform current state-of-the-art results.
翻译:语音情感识别(SER)是一项具有挑战性的任务,原因在于数据有限以及某些情感类别边界模糊。本文提出了一种贯穿模型生命周期(包括预训练、微调和推理阶段)的综合方法,以提升SER性能。为解决数据稀缺问题,我们采用了预训练模型wav2vec2.0。在微调阶段,我们提出了一种新型损失函数,将交叉熵损失与监督对比学习损失相结合,以提高模型的判别能力。该方法增大了类间距离并减小了类内距离,从而缓解了边界模糊的问题。最后,为利用改进的距离度量,我们在推理阶段提出了一种插值方法,将模型预测结果与k近邻模型的输出相结合。在IEMOCAP数据集上的实验表明,我们提出的方法超越了当前最优结果。