In this research, we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture, which has been pre-trained on an extensive podcast dataset, to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector, thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions, thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data, our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach, which relies solely on the provided audio data, shows marked advancements over the existing baseline, offering a more comprehensive understanding of emotional mimicry in naturalistic settings, achieving the second place in the EMI challenge.
翻译:本研究提出了一种评估情感模仿强度(EMI)的新方法,作为第六届野外情感行为分析研讨会及竞赛的一部分。我们的方法采用已在海量播客数据集上预训练的Wav2Vec 2.0架构,以捕捉包含语言和副语言成分的广泛音频特征。我们通过融合技术改进特征提取过程,将个体特征与全局均值向量相结合,从而在分析中嵌入更广泛的上下文理解。该方法的关键在于多任务融合策略,不仅利用这些特征,还整合了预训练的效价-唤醒-支配度(VAD)模型。这种集成旨在通过同时处理多个情感维度来优化情感强度预测,从而在我们的框架中嵌入更丰富的上下文理解。针对音频数据的时序分析,我们的特征融合过程采用长短期记忆(LSTM)网络。该方法仅依赖提供的音频数据,相较于现有基线显示出显著进步,为自然情境下的情感模仿提供了更全面的理解,最终在EMI挑战赛中荣获第二名。