We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.
翻译:我们提出了EmoDistill,一种新颖的语音情感识别(SER)框架,该框架在训练过程中通过跨模态知识蒸馏,从语音中学习强健的情感情感韵律和语言表征。在推理阶段,我们的方法仅使用单流语音信号进行单模态SER,从而降低了计算开销,并避免了运行时转录和韵律特征提取错误。在训练过程中,我们的方法从一对针对SER微调的预训练韵律教师和语言教师中,在嵌入层和逻辑层均进行信息蒸馏。在IEMOCAP基准上的实验表明,我们的方法以显著优势超越了其他单模态和多模态技术,并实现了77.49%的非加权准确率和78.91%的加权准确率这一最先进的性能。详细的消融研究证明了该方法各组成部分的影响。