Emotion Recognition in Conversation~(ERC) across modalities is of vital importance for a variety of applications, including intelligent healthcare, artificial intelligence for conversation, and opinion mining over chat history. The crux of ERC is to model both cross-modality and cross-time interactions throughout the conversation. Previous methods have made progress in learning the time series information of conversation while lacking the ability to trace down the different emotional states of each speaker in a conversation. In this paper, we propose a recurrent structure called Speaker Information Enhanced Long-Short Term Memory (SI-LSTM) for the ERC task, where the emotional states of the distinct speaker can be tracked in a sequential way to enhance the learning of the emotion in conversation. Further, to improve the learning of multimodal features in ERC, we utilize a cross-modal attention component to fuse the features between different modalities and model the interaction of the important information from different modalities. Experimental results on two benchmark datasets demonstrate the superiority of the proposed SI-LSTM against the state-of-the-art baseline methods in the ERC task on multimodal data.
翻译:跨模态对话情感识别(ERC)对智能医疗、对话人工智能以及聊天历史观点挖掘等多种应用至关重要。ERC的核心挑战在于建模对话中的跨模态与跨时间交互。以往方法在学习对话的时间序列信息方面取得了进展,但缺乏追踪对话中不同说话人不同情感状态的能力。本文提出一种名为"说话人信息增强长短时记忆(SI-LSTM)"的循环结构用于ERC任务,该结构能够以序列化方式追踪不同说话人的情感状态,从而增强对话情感学习能力。此外,为提升ERC中多模态特征的学习效果,我们采用跨模态注意力组件融合不同模态间的特征,并建模不同模态重要信息的交互。在两个基准数据集上的实验结果表明,所提出的SI-LSTM在多模态情感识别任务中显著优于当前基线方法。