Dawn of the transformer era in speech emotion recognition: closing the valence gap

Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during fine-tuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.

翻译：近期基于自监督预训练的Transformer架构在多项机器学习任务中展现出巨大潜力。在音频领域，此类架构已成功应用于语音情感识别（SER）。然而，现有研究尚未评估模型规模与预训练数据对下游性能的影响，且在泛化性、鲁棒性、公平性和效率方面关注有限。本研究对wav2vec 2.0与HuBERT的多种预训练变体进行了全面分析，这些模型基于MSP-Podcast数据集的唤醒度、支配度与效价维度进行微调，并额外使用IEMOCAP与MOSI数据集测试跨语料库泛化能力。据我们所知，本研究在不依赖显式语言信息的情况下，于MSP-Podcast数据集的效价预测任务上实现了0.638的一致性相关系数（CCC）最高性能。此外，我们的研究表明：相较于基于CNN的基线模型，Transformer架构对微小扰动具有更强鲁棒性，并在生物学性别分组上具备公平性，但无法保证对个体说话者的公平性。最后，我们首次证明其在效价预测上的卓越表现源于微调阶段从Transformer层习得的隐式语言信息，这解释了为何该架构能与近期显式利用文本信息的多模态方法性能持平。本研究的发现共同描绘了以下图景：Transformer架构已构成SER领域的新基准，但需进一步改进以解决剩余的鲁棒性及个体说话者问题。为确保可复现性，我们向学界发布性能最优模型。