Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.
翻译:对话中的情感识别因情感表达的多模态特性而颇具挑战。我们提出一种层次交叉注意力模型(HCAM)方法,该方法结合循环神经网络与协同注意力神经网络模型,用于多模态情感识别。模型输入包含两种模态:i) 音频数据,通过可学习的wav2vec方法处理;ii) 文本数据,采用基于变换器的双向编码器表示(BERT)模型表示。音频与文本表示经由一组带有自注意力的双向循环神经网络层处理,将给定对话中的每个话语转换为固定维度的嵌入。为整合跨模态的背景知识与信息,音频与文本嵌入通过一个协同注意力层进行融合,该层旨在对与情感识别任务相关的话语级别嵌入进行加权。音频层、文本层以及多模态协同注意力层中的神经网络参数,均针对情感分类任务进行层次化训练。我们在三个既定数据集(即IEMOCAP、MELD及CMU-MOSI)上开展实验,结果表明所提模型相较于其他基准方法有显著提升,并在所有数据集上实现了最先进的结果。