Automatic emotion recognition has recently gained significant attention due to the growing popularity of deep learning algorithms. One of the primary challenges in emotion recognition is effectively utilizing the various cues (modalities) available in the data. Another challenge is providing a proper explanation of the outcome of the learning.To address these challenges, we present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK), a generalized and modular system for human emotion recognition and explanation using visual information. Our system can handle multiple modalities, including facial expressions, posture, and gait, in a flexible and modular manner. The network consists of different modules that can be added or removed depending on the available data. We utilize a two-stream network architecture with convolutional neural networks (CNNs) and encoder-decoder style attention mechanisms to extract deep features from face images. Similarly, CNNs and recurrent neural networks (RNNs) with Long Short-term Memory (LSTM) are employed to extract features from posture and gait data. We also incorporate deep features from the background as contextual information for the learning process. The deep features from each module are fused using an early fusion network. Furthermore, we leverage situational knowledge derived from the location type and adjective-noun pair (ANP) extracted from the scene, as well as the spatio-temporal average distribution of emotions, to generate explanations. Ablation studies demonstrate that each sub-network can independently perform emotion recognition, and combining them in a multimodal approach significantly improves overall recognition performance. Extensive experiments conducted on various benchmark datasets, including GroupWalk, validate the superior performance of our approach compared to other state-of-the-art methods.
翻译:自动情感识别近年来因深度学习算法的广泛普及而受到显著关注。情感识别的主要挑战之一在于有效利用数据中的多种线索(模态),另一挑战是为学习结果提供合理解释。为应对这些挑战,我们提出了基于情境知识的可解释多模态情感识别系统(EMERSK),这是一个利用视觉信息进行人类情感识别与解释的通用模块化系统。该系统能够以灵活模块化的方式处理包括面部表情、姿态和步态在内的多种模态。其网络由可根据可用数据增减的不同模块构成。我们采用双流网络架构,结合卷积神经网络(CNN)与编码器-解码器式注意力机制,从人脸图像中提取深度特征;同时利用CNN和带有长短期记忆(LSTM)的循环神经网络(RNN)从姿态与步态数据中提取特征。此外,我们将背景的深度特征作为上下文信息融入学习过程,并通过早期融合网络对各模块的深度特征进行融合。进一步地,我们利用从场景中提取的地点类型、形容词-名词对(ANP)以及情感时空平均分布中获取的情境知识生成解释。消融研究表明,每个子网络均可独立完成情感识别任务,而多模态组合方式显著提升了整体识别性能。在包括GroupWalk在内的多个基准数据集上进行的广泛实验验证了我们的方法相较于其他先进方法具有更优性能。