Emotion recognition is a crucial task for human conversation understanding. It becomes more challenging with the notion of multimodal data, e.g., language, voice, and facial expressions. As a typical solution, the global- and the local context information are exploited to predict the emotional label for every single sentence, i.e., utterance, in the dialogue. Specifically, the global representation could be captured via modeling of cross-modal interactions at the conversation level. The local one is often inferred using the temporal information of speakers or emotional shifts, which neglects vital factors at the utterance level. Additionally, most existing approaches take fused features of multiple modalities in an unified input without leveraging modality-specific representations. Motivating from these problems, we propose the Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), an novel neural network framework that effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with the modality-specific manner for conversation understanding. Extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.
翻译:情感识别是理解人类对话的关键任务,而多模态数据(如语言、语音和面部表情)的引入使其更具挑战性。典型的解决方案是结合全局与局部上下文信息来预测对话中每个语句(即话语)的情感标签。具体而言,全局表征可通过建模对话层面的跨模态交互获取;局部表征则通常利用说话者时序信息或情感突变推理得到,但这种方法忽略了话语层面的关键因素。此外,现有方法大多将多模态特征融合为统一输入,未能充分利用模态特定表征。针对上述问题,我们提出了具有辅助跨模态交互的关系时序图神经网络(CORECT),这是一种新颖的神经网络框架,能够以模态特定方式有效捕获对话层面的跨模态交互和话语层面的时序依赖关系。大量实验表明,CORECT在IEMOCAP和CMU-MOSEI数据集上的多模态情感识别任务中取得了最优结果,验证了其有效性。