Emotions play an essential role in human communication. Developing computer vision models for automatic recognition of emotion expression can aid in a variety of domains, including robotics, digital behavioral healthcare, and media analytics. There are three types of emotional representations which are traditionally modeled in affective computing research: Action Units, Valence Arousal (VA), and Categorical Emotions. As part of an effort to move beyond these representations towards more fine-grained labels, we describe our submission to the newly introduced Emotional Reaction Intensity (ERI) Estimation challenge in the 5th competition for Affective Behavior Analysis in-the-Wild (ABAW). We developed four deep neural networks trained in the visual domain and a multimodal model trained with both visual and audio features to predict emotion reaction intensity. Our best performing model on the Hume-Reaction dataset achieved an average Pearson correlation coefficient of 0.4080 on the test set using a pre-trained ResNet50 model. This work provides a first step towards the development of production-grade models which predict emotion reaction intensities rather than discrete emotion categories.
翻译:情感在人类交流中扮演着关键角色。开发用于自动识别情感表达的计算机视觉模型可助力多个领域,包括机器人技术、数字行为医疗保健和媒体分析。在情感计算研究中,传统建模的情感表征类型主要有三种:动作单元、效价-唤醒度(VA)和分类情感。为突破这些表征方式以实现更细粒度的标签标注,我们描述了针对第五届野外情感行为分析(ABAW)竞赛中新引入的情感反应强度(ERI)估计挑战的提交方案。我们开发了四种基于视觉域的深度神经网络训练模型,以及一种融合视觉和音频特征的多模态模型,用于预测情感反应强度。在Hume-Reaction数据集上,采用预训练ResNet50模型的最优方案在测试集上达到了0.4080的平均皮尔逊相关系数。本研究为开发预测情感反应强度(而非离散情感类别)的生产级模型迈出了第一步。