Emotion Recognition in Conversation~(ERC) across modalities is of vital importance for a variety of applications, including intelligent healthcare, artificial intelligence for conversation, and opinion mining over chat history. The crux of ERC is to model both cross-modality and cross-time interactions throughout the conversation. Previous methods have made progress in learning the time series information of conversation while lacking the ability to trace down the different emotional states of each speaker in a conversation. In this paper, we propose a recurrent structure called Speaker Information Enhanced Long-Short Term Memory (SI-LSTM) for the ERC task, where the emotional states of the distinct speaker can be tracked in a sequential way to enhance the learning of the emotion in conversation. Further, to improve the learning of multimodal features in ERC, we utilize a cross-modal attention component to fuse the features between different modalities and model the interaction of the important information from different modalities. Experimental results on two benchmark datasets demonstrate the superiority of the proposed SI-LSTM against the state-of-the-art baseline methods in the ERC task on multimodal data.
翻译:跨模态对话情感识别对于智能医疗、对话人工智能以及聊天历史意见挖掘等多种应用至关重要。对话情感识别的核心在于建模整个对话过程中的跨模态交互与跨时间交互。现有方法在学习对话的时间序列信息方面取得了进展,但缺乏追踪对话中每个说话人不同情感状态的能力。本文提出一种名为"说话人信息增强长短时记忆"的递归结构用于对话情感识别任务,该结构能够以序列化方式追踪不同说话人的情感状态,从而增强对话情感学习。此外,为提升对话情感识别中多模态特征的学习能力,我们采用跨模态注意力组件融合不同模态间的特征,并建模不同模态重要信息之间的交互。在两个基准数据集上的实验结果表明,在多模态数据的对话情感识别任务中,所提出的SI-LSTM方法优于现有最先进的基线方法。