Foundation models have shown superior performance for speech emotion recognition (SER). However, given the limited data in emotion corpora, finetuning all parameters of large pre-trained models for SER can be both resource-intensive and susceptible to overfitting. This paper investigates parameter-efficient finetuning (PEFT) for SER. Various PEFT adaptors are systematically studied for both classification of discrete emotion categories and prediction of dimensional emotional attributes. The results demonstrate that the combination of PEFT methods surpasses full finetuning with a significant reduction in the number of trainable parameters. Furthermore, a two-stage adaptation strategy is proposed to adapt models trained on acted emotion data, which is more readily available, to make the model more adept at capturing natural emotional expressions. Both intra- and cross-corpus experiments validate the efficacy of the proposed approach in enhancing the performance on both the source and target domains.
翻译:基础模型在语音情感识别(SER)中展现出卓越性能。然而,由于情感语料库数据有限,对大规模预训练模型的所有参数进行SER微调既耗费资源又易陷入过拟合。本文针对SER任务研究高效参数微调(PEFT)方法,系统探讨了多种PEFT适配器在离散情感分类与连续情感维度预测中的应用。实验结果表明,PEFT方法的组合不仅显著降低了可训练参数量,其性能更全面超越全参数微调。此外,本文提出一种两阶段自适应策略,将基于更易获取的表演型情感数据训练的模型进行适应性优化,使其更擅长捕捉自然情感表达。通过语料库内部与跨语料库实验,验证了该方法在源域和目标域上均可有效提升性能。