Accurate emotion recognition is pivotal for nuanced and engaging human-computer interactions, yet remains difficult to achieve, especially in dynamic, conversation-like settings. In this study, we showcase how integrating eye-tracking data, temporal dynamics, and personality traits can substantially enhance the detection of both perceived and felt emotions. Seventy-three participants viewed short, speech-containing videos from the CREMA-D dataset, while being recorded for eye-tracking signals (pupil size, fixation patterns), Big Five personality assessments, and self-reported emotional states. Our neural network models combined these diverse inputs including stimulus emotion labels for contextual cues and yielded marked performance gains compared to the state-of-the-art. Specifically, perceived valence predictions reached a macro F1-score of 0.76, and models incorporating personality traits and stimulus information demonstrated significant improvements in felt emotion accuracy. These results highlight the benefit of unifying physiological, individual and contextual factors to address the subjectivity and complexity of emotional expression. Beyond validating the role of user-specific data in capturing subtle internal states, our findings inform the design of future affective computing and human-agent systems, paving the way for more adaptive and cross-individual emotional intelligence in real-world interactions.
翻译:精确的情绪识别对于实现细腻且引人入胜的人机交互至关重要,但在动态的、类似对话的情境中尤其难以实现。本研究展示了整合眼动追踪数据、时间动态特征和人格特质如何显著提升对感知情绪与体验情绪的检测能力。73名参与者观看了来自CREMA-D数据集的包含语音的短视频,同时记录了眼动信号(瞳孔大小、注视模式)、大五人格评估以及自我报告的情绪状态。我们的神经网络模型结合了这些多样化输入,包括作为上下文线索的刺激情绪标签,与现有最先进方法相比取得了显著的性能提升。具体而言,感知效价的预测宏F1分数达到0.76,而融合人格特质和刺激信息的模型在体验情绪准确率上表现出显著改善。这些结果凸显了整合生理、个体和情境因素以应对情绪表达的主观性与复杂性的优势。除了验证用户特定数据在捕捉微妙内部状态中的作用外,我们的发现为未来情感计算与人机代理系统的设计提供了参考,为现实世界交互中更具适应性和跨个体情感智能的发展铺平了道路。