Many recent studies have focused on fine-tuning pre-trained models for speech emotion recognition (SER), resulting in promising performance compared to traditional methods that rely largely on low-level, knowledge-inspired acoustic features. These pre-trained speech models learn general-purpose speech representations using self-supervised or weakly-supervised learning objectives from large-scale datasets. Despite the significant advances made in SER through the use of pre-trained architecture, fine-tuning these large pre-trained models for different datasets requires saving copies of entire weight parameters, rendering them impractical to deploy in real-world settings. As an alternative, this work explores parameter-efficient fine-tuning (PEFT) approaches for adapting pre-trained speech models for emotion recognition. Specifically, we evaluate the efficacy of adapter tuning, embedding prompt tuning, and LoRa (Low-rank approximation) on four popular SER testbeds. Our results reveal that LoRa achieves the best fine-tuning performance in emotion recognition while enhancing fairness and requiring only a minimal extra amount of weight parameters. Furthermore, our findings offer novel insights into future research directions in SER, distinct from existing approaches focusing on directly fine-tuning the model architecture. Our code is publicly available under: https://github.com/usc-sail/peft-ser.
翻译:近年来,众多研究聚焦于微调预训练模型以进行语音情感识别(SER),相较于主要依赖低层次、知识驱动的声学特征的传统方法,该方法取得了显著性能提升。这些预训练语音模型通过自监督或弱监督学习目标,从大规模数据集中习得通用语音表征。尽管基于预训练架构的SER研究取得了重大进展,但针对不同数据集微调这些大型预训练模型需要保存完整的权重参数副本,导致其在实际部署中缺乏实用性。作为替代方案,本研究探索了参数高效微调方法在适配预训练语音模型进行情感识别中的应用。具体而言,我们在四个主流SER测试平台上评估了适配器微调、嵌入提示微调以及LoRa(低秩近似)的效果。结果表明,LoRa在情感识别中实现了最优微调性能,同时提升了公平性,且仅需增加极少量额外权重参数。此外,我们的发现为SER未来研究方向提供了全新见解,区别于现有专注于直接微调模型架构的方法。我们的代码已公开于:https://github.com/usc-sail/peft-ser。