In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain", "Excitement" and "Joy"). To tackle this challenge, we extracted rich dual-channel visual features based on ResNet18 and AUs for the video modality and effective single-channel features based on Wav2Vec2.0 for the audio modality. This allowed us to obtain comprehensive emotional features for the audiovisual modality. Additionally, leveraging a late fusion strategy, we averaged the predictions of the visual and acoustic models, resulting in a more accurate estimation of audiovisual emotional mimicry intensity. Experimental results validate the effectiveness of our approach, with the average Pearson's correlation Coefficient($\rho$) across the 6 emotion dimensionson the validation set achieving 0.3288.
翻译:本文提出了针对第六届野外情感行为分析(ABAW)竞赛中情感模仿强度(EMI)估计挑战的解决方案。该挑战任务旨在通过预定义的情感类别(即“钦佩”、“娱乐”、“决心”、“共情痛苦”、“兴奋”和“喜悦”)评估种子视频的情感强度。为应对该挑战,我们从视频模态中基于ResNet18和AUs提取了丰富的双通道视觉特征,并从音频模态中基于Wav2Vec2.0提取了有效的单通道音频特征,从而获得了全面的视听模态情感特征。此外,利用晚期融合策略,我们对视觉和音频模型的预测结果进行平均,实现了对视听情感模仿强度的更准确估计。实验结果验证了该方法的有效性,在验证集上6个情感维度的平均皮尔逊相关系数(ρ)达到了0.3288。