Multimodal machine learning is an emerging area of research, which has received a great deal of scholarly attention in recent years. Up to now, there are few studies on multimodal conversational emotion recognition. Since Graph Neural Networks (GNNs) possess the powerful capacity of relational modeling, they have an inherent advantage in the field of multimodal learning. GNNs leverage the graph constructed from multimodal data to perform intra- and inter-modal information interaction, which effectively facilitates the integration and complementation of multimodal data. In this work, we propose a novel Graph network based Multimodal Fusion Technique (GraphMFT) for emotion recognition in conversation. Multimodal data can be modeled as a graph, where each data object is regarded as a node, and both intra- and inter-modal dependencies existing between data objects can be regarded as edges. GraphMFT utilizes multiple improved graph attention networks to capture intra-modal contextual information and inter-modal complementary information. In addition, the proposed GraphMFT attempts to address the challenges of existing graph-based multimodal Emotion Recognition in Conversation (ERC) models such as MMGCN. Empirical results on two public multimodal datasets reveal that our model outperforms the State-Of-The-Art (SOTA) approaches with the accuracy of 67.90% and 61.30%.
翻译:多模态机器学习是一个新兴的研究领域,近年来受到了学术界的广泛关注。迄今为止,关于多模态对话情感识别的研究仍较为有限。由于图神经网络具有强大的关系建模能力,在多模态学习领域具有天然优势。图神经网络利用从多模态数据构建的图结构进行模态内和模态间的信息交互,有效促进了多模态数据的整合与互补。本文提出了一种新颖的基于图网络的多模态融合技术,用于对话中的情感识别。多模态数据可建模为图结构,其中每个数据对象被视为节点,而数据对象间存在的模态内和模态间依赖关系则可视为边。所提方法利用多个改进的图注意力网络分别捕获模态内的上下文信息和模态间的互补信息。此外,本文提出的GraphMFT试图解决现有基于图的多模态对话情感识别模型(如MMGCN)所面临的挑战。在两个公开多模态数据集上的实验结果表明,本模型在准确率分别达到67.90%和61.30%的情况下,优于当前最先进方法。