Multimodal emotion recognition is a challenging research area that aims to fuse different modalities to predict human emotion. However, most existing models that are based on attention mechanisms have difficulty in learning emotionally relevant parts on their own. To solve this problem, we propose to incorporate external emotion-related knowledge in the co-attention based fusion of pre-trained models. To effectively incorporate this knowledge, we enhance the co-attention model with a Bayesian attention module (BAM) where a prior distribution is estimated using the emotion-related knowledge. Experimental results on the IEMOCAP dataset show that the proposed approach can outperform several state-of-the-art approaches by at least 0.7% unweighted accuracy (UA).
翻译:多模态情感识别是一个具有挑战性的研究领域,旨在融合不同模态来预测人类情感。然而,现有的大多数基于注意力机制的模型难以自主学习与情感相关的部分。为解决这一问题,我们提出在基于协同注意力机制的预训练模型融合中引入外部情感相关知识。为了有效整合这些知识,我们通过贝叶斯注意力模块(BAM)增强协同注意力模型,其中使用情感相关知识来估计先验分布。在IEMOCAP数据集上的实验结果表明,所提方法在未加权准确率(UA)上至少比多种当前最优方法高出0.7%。