Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
翻译:近期,语音自监督模型在下游任务中的应用引起了广泛关注。尽管大型预训练模型通常优于从头训练的小型模型,但关于最优微调策略的问题仍普遍存在。本文针对MSP播客语料库上的语音情感识别任务,探索了WavLM Large模型的微调策略。具体而言,我们开展了一系列实验,重点利用语音中的性别与语义信息。随后总结了研究发现,并描述了提交至2024年语音情感识别挑战赛的最终模型。