Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text. Prior works mainly focus on exploiting advanced networks to model and fuse different modality information to facilitate performance, while neglecting the effect of different fusion strategies on emotion recognition. In this work, we consider a simple yet important problem: how to fuse audio and text modality information is more helpful for this multimodal task. Further, we propose a multimodal emotion recognition model improved by perspective loss. Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset. The in-depth analysis explains why the improved model can achieve improvements and outperforms baselines.
翻译:多模态语音情感识别旨在从音频和文本中检测说话者的情感。以往的研究主要侧重于利用先进网络对不同模态信息进行建模和融合以提升性能,而忽略了不同融合策略对情感识别的影响。本文考虑了一个简单但重要的问题:如何融合音频和文本模态信息更有利于这一多模态任务。进一步,我们提出了一种通过视角损失改进的多模态情感识别模型。实验结果表明,我们的方法在IEMOCAP数据集上取得了新的最先进结果。深入分析解释了改进模型为何能够取得性能提升并优于基线方法。