Accurately detecting emotions in conversation is a necessary yet challenging task due to the complexity of emotions and dynamics in dialogues. The emotional state of a speaker can be influenced by many different factors, such as interlocutor stimulus, dialogue scene, and topic. In this work, we propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions. First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances. Second, an attentive bi-directional gated recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly in a dynamic manner. The experiments conducted on the standard conversational dataset MELD demonstrate the effectiveness of the proposed method when compared against state-of the-art methods.
翻译:准确检测对话中的情感是一项必要且具有挑战性的任务,其原因在于情感本身的复杂性和对话过程中的动态变化。说话者的情感状态可能受到多种不同因素的影响,例如对话者的刺激、对话场景和话题。在这项工作中,我们提出了一种会话语音情感识别方法,旨在捕获具有注意力的上下文依赖关系和说话者敏感交互。首先,我们使用预训练的VGGish模型提取单个话语中基于片段的音频表示。其次,一个具有注意力机制的双向门控循环单元(GRU)以动态方式对上下文敏感信息进行建模,并联合探索说话者内部和说话者之间的依赖关系。在标准会话数据集MELD上进行的实验表明,与现有最先进方法相比,所提出的方法具有有效性。