Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum

Efficiently capturing consistent and complementary semantic features in a multimodal conversation context is crucial for Multimodal Emotion Recognition in Conversation (MERC). Existing methods mainly use graph structures to model dialogue context semantic dependencies and employ Graph Neural Networks (GNN) to capture multimodal semantic features for emotion recognition. However, these methods are limited by some inherent characteristics of GNN, such as over-smoothing and low-pass filtering, resulting in the inability to learn long-distance consistency information and complementary information efficiently. Since consistency and complementarity information correspond to low-frequency and high-frequency information, respectively, this paper revisits the problem of multimodal emotion recognition in conversation from the perspective of the graph spectrum. Specifically, we propose a Graph-Spectrum-based Multimodal Consistency and Complementary collaborative learning framework GS-MCC. First, GS-MCC uses a sliding window to construct a multimodal interaction graph to model conversational relationships and uses efficient Fourier graph operators to extract long-distance high-frequency and low-frequency information, respectively. Then, GS-MCC uses contrastive learning to construct self-supervised signals that reflect complementarity and consistent semantic collaboration with high and low-frequency signals, thereby improving the ability of high and low-frequency information to reflect real emotions. Finally, GS-MCC inputs the collaborative high and low-frequency information into the MLP network and softmax function for emotion prediction. Extensive experiments have proven the superiority of the GS-MCC architecture proposed in this paper on two benchmark data sets.

翻译：高效捕捉多模态对话上下文中一致性与互补性语义特征，对于对话多模态情感识别任务至关重要。现有方法主要采用图结构建模对话上下文语义依赖关系，并借助图神经网络捕获多模态语义特征以进行情感识别。然而，这些方法受限于图神经网络的固有特性（如过平滑与低通滤波），导致难以高效学习长距离一致性信息与互补性信息。鉴于一致性与互补性信息分别对应低频信息与高频信息，本文从图频谱视角重新审视对话多模态情感识别问题。具体而言，我们提出基于图频谱的多模态一致性与互补性协同学习框架GS-MCC。首先，GS-MCC采用滑动窗口构建多模态交互图以建模对话关系，并利用高效傅里叶图算子分别提取长距离高频信息与低频信息；其次，通过对比学习构建反映互补性与一致性语义协作的自监督信号，以提升高、低频信息真实反映情感的能力；最后，将协同后的高、低频信息输入多层感知机网络与softmax函数进行情感预测。大量实验证明了本文提出的GS-MCC架构在两个基准数据集上的优越性。