Owing to the recent developments in Generative Artificial Intelligence (GenAI) and Large Language Models (LLM), conversational agents are becoming increasingly popular and accepted. They provide a human touch by interacting in ways familiar to us and by providing support as virtual companions. Therefore, it is important to understand the user's emotions in order to respond considerately. Compared to the standard problem of emotion recognition, conversational agents face an additional constraint in that recognition must be real-time. Studies on model architectures using audio, visual, and textual modalities have mainly focused on emotion classification using full video sequences that do not provide online features. In this work, we present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL). Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information. The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances. More importantly, Graph Convolutional Networks (GCN) and Reinforcement Learning (RL) agents are cascade trained to capture the complex dependencies of emotion features in interactive scenarios. Comparing the results of the conER-GRL model with other state-of-the-art models on the benchmark dataset IEMOCAP demonstrates the advantageous capabilities of the conER-GRL architecture in recognizing emotions in real-time from multimodal conversational signals.
翻译:随着生成式人工智能和大型语言模型的最新发展,对话代理正日益普及和接受。它们通过以熟悉的方式互动、提供虚拟伴侣支持等方式带来人性化体验。因此,理解用户情感对于做出体贴回应至关重要。与标准情感识别问题相比,对话代理面临额外约束:识别必须实时进行。当前基于音频、视觉和文本模态的模型架构研究主要依赖完整视频序列进行情感分类,无法提供在线特征。本文提出一种基于图卷积网络与强化学习的语境化情感识别新范式(conER-GRL)。将对话分割成较小的语段组以有效提取语境信息,系统采用门控循环单元(GRU)从这些语段组中提取多模态特征。更重要的是,通过级联训练图卷积网络和强化学习代理,捕捉交互场景中情感特征的复杂依赖关系。在基准数据集IEMOCAP上,将conER-GRL模型与其他先进模型进行对比,结果表明该架构在从多模态对话信号中实时识别情感方面具有显著优势。