Multimodal emotion recognition aims to recognize emotions for each utterance of multiple modalities, which has received increasing attention for its application in human-machine interaction. Current graph-based methods fail to simultaneously depict global contextual features and local diverse uni-modal features in a dialogue. Furthermore, with the number of graph layers increasing, they easily fall into over-smoothing. In this paper, we propose a method for joint modality fusion and graph contrastive learning for multimodal emotion recognition (Joyful), where multimodality fusion, contrastive learning, and emotion recognition are jointly optimized. Specifically, we first design a new multimodal fusion mechanism that can provide deep interaction and fusion between the global contextual and uni-modal specific features. Then, we introduce a graph contrastive learning framework with inter-view and intra-view contrastive losses to learn more distinguishable representations for samples with different sentiments. Extensive experiments on three benchmark datasets indicate that Joyful achieved state-of-the-art (SOTA) performance compared to all baselines.
翻译:多模态情绪识别旨在识别每段话语在多模态下的情绪,因其在人机交互中的应用而受到越来越多的关注。当前基于图的方法无法同时描绘对话中的全局上下文特征和局部多样的单模态特征。此外,随着图层的增加,它们容易陷入过度平滑。本文提出了一种联合模态融合与图对比学习的多模态情绪识别方法(Joyful),其中多模态融合、对比学习和情绪识别被联合优化。具体而言,我们首先设计了一种新的多模态融合机制,能够实现全局上下文与单模态特定特征之间的深度交互与融合。接着,我们引入了一个图对比学习框架,通过视图间和视图内对比损失,为不同情感样本学习更具区分性的表示。在三个基准数据集上的广泛实验表明,Joyful相较于所有基线方法取得了最先进的性能。