Emotion recognition is a crucial task for human conversation understanding. It becomes more challenging with the notion of multimodal data, e.g., language, voice, and facial expressions. As a typical solution, the global- and the local context information are exploited to predict the emotional label for every single sentence, i.e., utterance, in the dialogue. Specifically, the global representation could be captured via modeling of cross-modal interactions at the conversation level. The local one is often inferred using the temporal information of speakers or emotional shifts, which neglects vital factors at the utterance level. Additionally, most existing approaches take fused features of multiple modalities in an unified input without leveraging modality-specific representations. Motivating from these problems, we propose the Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), an novel neural network framework that effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with the modality-specific manner for conversation understanding. Extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.
翻译:情感识别是人际对话理解中的关键任务,多模态数据(如语言、语音和面部表情)的引入使其更具挑战性。作为典型解决方案,现有方法利用全局和局部上下文信息预测对话中每个句子(即话语)的情感标签。具体而言,全局表征可通过建模对话级别的跨模态交互来获取;而局部表征通常借助说话者时序信息或情感转变来推断,这忽略了话语层面关键因素。此外,大多数现有方法将多模态特征融合为统一输入,未能充分利用模态特异性表征。针对这些问题,我们提出带有辅助跨模态交互的关系时序图神经网络(CORECT)——一种新颖的神经网络框架,能够以模态特异性方式有效捕获对话级跨模态交互与话语级时序依赖关系,实现对话理解。在IEMOCAP和CMU-MOSEI数据集上的多模态情感识别(ERC)任务实验结果表明,CORECT通过达到当前最优性能证明了其有效性。