Speech Emotion Recognition (SER) has been traditionally formulated as a classification task. However, emotions are generally a spectrum whose distribution varies from situation to situation leading to poor Out-of-Domain (OOD) performance. We take inspiration from statistical formulation of Automatic Speech Recognition (ASR) and formulate the SER task as generating the most likely sequence of text tokens to infer emotion. The formulation breaks SER into predicting acoustic model features weighted by language model prediction. As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. We train SELM on curated speech emotion corpus and test it on three OOD datasets (RAVDESS, CREMAD, IEMOCAP) not used in training. SELM achieves significant improvements over the state-of-the-art baselines, with 17% and 7% relative accuracy gains for RAVDESS and CREMA-D, respectively. Moreover, SELM can further boost its performance by Few-Shot Learning using a few annotated examples. The results highlight the effectiveness of our SER formulation, especially to improve performance in OOD scenarios.
翻译:语音情感识别传统上被构建为分类任务。然而,情感本质上是连续谱系,其分布随情境变化,导致域外性能不佳。我们受自动语音识别的统计建模启发,将SER任务重新构建为生成最可能的文本标记序列以推断情感。该框架将SER分解为通过语言模型预测加权的声学特征预测。作为该方法的具体实现,我们提出SELM——一种用于预测多维度情感视图的音频条件语言模型。我们在精选的语音情感语料库上训练SELM,并在三个未参与训练的域外数据集(RAVDESS、CREMAD、IEMOCAP)上进行测试。SELM相较于现有最优基线取得显著提升,在RAVDESS和CREMA-D数据集上分别获得17%和7%的相对准确率增益。此外,SELM可通过少量标注样本进行小样本学习以进一步提升性能。实验结果验证了我们提出的SER框架的有效性,特别是在提升域外场景性能方面具有显著优势。