Emotion Recognition in Conversation (ERC) plays a significant part in Human-Computer Interaction (HCI) systems since it can provide empathetic services. Multimodal ERC can mitigate the drawbacks of uni-modal approaches. Recently, Graph Neural Networks (GNNs) have been widely used in a variety of fields due to their superior performance in relation modeling. In multimodal ERC, GNNs are capable of extracting both long-distance contextual information and inter-modal interactive information. Unfortunately, since existing methods such as MMGCN directly fuse multiple modalities, redundant information may be generated and diverse information may be lost. In this work, we present a directed Graph based Cross-modal Feature Complementation (GraphCFC) module that can efficiently model contextual and interactive information. GraphCFC alleviates the problem of heterogeneity gap in multimodal fusion by utilizing multiple subspace extractors and Pair-wise Cross-modal Complementary (PairCC) strategy. We extract various types of edges from the constructed graph for encoding, thus enabling GNNs to extract crucial contextual and interactive information more accurately when performing message passing. Furthermore, we design a GNN structure called GAT-MLP, which can provide a new unified network framework for multimodal learning. The experimental results on two benchmark datasets show that our GraphCFC outperforms the state-of-the-art (SOTA) approaches.
翻译:摘要:对话情感识别(ERC)在人机交互(HCI)系统中具有重要意义,因为它能够提供共情服务。多模态ERC可以弥补单模态方法的缺陷。近年来,图神经网络(GNNs)由于在关系建模中的卓越性能被广泛应用于多个领域。在多模态ERC中,GNNs能够同时提取长距离上下文信息和跨模态交互信息。然而,现有方法(如MMGCN)直接融合多模态信息,可能导致冗余信息产生和多样性信息丢失。本文提出了一种基于有向图的跨模态特征互补(GraphCFC)模块,该模块能够高效建模上下文和交互信息。GraphCFC通过利用多个子空间提取器和成对跨模态互补(PairCC)策略,缓解了多模态融合中的异质性差距问题。我们从构建的图中提取多种类型的边进行编码,从而使GNNs在执行消息传递时能够更准确地提取关键上下文和交互信息。此外,我们设计了一种名为GAT-MLP的GNN结构,为多模态学习提供了新的统一网络框架。在两个基准数据集上的实验结果表明,我们的GraphCFC方法优于当前最先进(SOTA)方法。