Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal Emotion Recognition in Conversation (ERC). Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal conversational emotion recognition models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%.
翻译:摘要:多模态机器学习是一个新兴的研究领域,近年来受到了学术界的广泛关注。截至目前,针对多模态对话情感识别(ERC)的研究尚为数不多。由于图神经网络(GNN)具有强大的关系建模能力,在多模态学习领域展现出固有优势。GNN利用从多模态数据构建的图来执行模态内和模态间的信息交互,有效促进了多模态数据的集成与互补。本文提出了一种新颖的基于图网络的多模态融合技术(GraphMFT),用于对话中的情感识别。多模态数据可建模为图结构,其中每个数据对象视为节点,而数据对象间存在的模态内和模态间依赖关系则可视为边。GraphMFT利用多个改进的图注意力网络来捕获模态内上下文信息与模态间互补信息。此外,所提出的GraphMFT尝试解决现有基于图的多模态对话情感识别模型(如MMGCN)面临的挑战。在两个公开多模态数据集上的实验结果表明,我们的模型以67.90%和61.30%的准确率超越了当前最先进(SOTA)方法。