Human emotion recognition plays an important role in human-computer interaction. In this paper, we present our approach to the Valence-Arousal (VA) Estimation Challenge, Expression (Expr) Classification Challenge, and Action Unit (AU) Detection Challenge of the 5th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Specifically, we propose a novel multi-modal fusion model that leverages Temporal Convolutional Networks (TCN) and Transformer to enhance the performance of continuous emotion recognition. Our model aims to effectively integrate visual and audio information for improved accuracy in recognizing emotions. Our model outperforms the baseline and ranks 3 in the Expression Classification challenge.
翻译:人体情绪识别在人机交互中具有重要作用。本文介绍了我们在第五届野外情感行为分析研讨会暨竞赛(ABAW)中参与效价-唤醒度(VA)估计挑战赛、表情分类挑战赛和动作单元(AU)检测挑战赛的方法。具体而言,我们提出了一种新颖的多模态融合模型,该模型利用时序卷积网络(TCN)和Transformer来提升连续情绪识别的性能。该模型旨在有效整合视觉与音频信息,从而提高情绪识别的准确率。在表情分类挑战赛中,我们的模型性能优于基线模型,并位列第三。