Emotion recognition plays a crucial role in various domains of human-robot interaction. In long-term interactions with humans, robots need to respond continuously and accurately, however, the mainstream emotion recognition methods mostly focus on short-term emotion recognition, disregarding the context in which emotions are perceived. Humans consider that contextual information and different contexts can lead to completely different emotional expressions. In this paper, we introduce self context-aware model (SCAM) that employs a two-dimensional emotion coordinate system for anchoring and re-labeling distinct emotions. Simultaneously, it incorporates its distinctive information retention structure and contextual loss. This approach has yielded significant improvements across audio, video, and multimodal. In the auditory modality, there has been a notable enhancement in accuracy, rising from 63.10% to 72.46%. Similarly, the visual modality has demonstrated improved accuracy, increasing from 77.03% to 80.82%. In the multimodal, accuracy has experienced an elevation from 77.48% to 78.93%. In the future, we will validate the reliability and usability of SCAM on robots through psychology experiments.
翻译:情感识别在人机交互的多个领域中扮演着关键角色。在与人类进行长期交互时,机器人需要持续且准确地做出响应,然而主流的情感识别方法大多侧重于短期情感识别,忽视了情感感知所处的上下文。人类认为,上下文信息以及不同的情境可能导致截然不同的情感表达。在本文中,我们引入了自上下文感知模型(SCAM),该模型采用二维情感坐标系来锚定和重新标注不同情感,同时结合了其特有的信息保留结构和上下文损失函数。该方法在音频、视频及多模态方面均取得了显著改进。在听觉模态中,准确率从63.10%显著提升至72.46%;视觉模态的准确率也从77.03%提升至80.82%;多模态准确率则从77.48%提升至78.93%。未来,我们将通过心理学实验验证SCAM在机器人上的可靠性与实用性。