Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal Emotion Recognition in Conversation (ERC). Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal conversational emotion recognition models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%.
翻译:多模态机器学习是一个新兴的研究领域,近年来受到了学术界的广泛关注。迄今为止,关于多模态对话情感识别(ERC)的研究仍较为有限。由于图神经网络(GNNs)具有强大的关系建模能力,在多模态学习领域具有天然优势。GNNs利用由多模态数据构建的图结构,实现模态内和模态间的信息交互,从而有效促进多模态数据的整合与互补。在本工作中,我们提出了一种新颖的基于图网络的多模态融合技术(GraphMFT),用于对话中的情感识别。多模态数据可被建模为图结构,其中每个数据对象被视为节点,而数据对象之间存在的模态内和模态间依赖关系则被视为边。GraphMFT利用多种改进的图注意力网络来捕获模态内的上下文信息以及模态间的互补信息。此外,所提出的GraphMFT尝试解决现有基于图的多模态对话情感识别模型(如MMGCN)所面临的挑战。在两个公开多模态数据集上的实验结果表明,我们的模型在准确率上分别达到67.90%和61.30%,优于当前最优方法(SOTA)。