In this paper, we propose MMER, a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. MMER leverages a novel multimodal network based on early-fusion and cross-modal self-attention between text and acoustic modalities and solves three novel auxiliary tasks for learning emotion recognition from spoken utterances. In practice, MMER outperforms all our baselines and achieves state-of-the-art performance on the IEMOCAP benchmark. Additionally, we conduct extensive ablation studies and results analysis to prove the effectiveness of our proposed approach.
翻译:本文提出MMER,一种新颖的面向语音情感识别的多模态多任务学习方法。MMER采用基于文本与声学模态早期融合及跨模态自注意力的新型多模态网络,并求解三个新颖的辅助任务以从语音语句中学习情感识别。在实践中,MMER优于所有基线方法,并在IEMOCAP基准测试中取得当前最优性能。此外,我们通过广泛的消融实验与结果分析,证明了所提方法的有效性。