In this study, we propose a methodology for the Emotional Mimicry Intensity (EMI) Estimation task within the context of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our approach leverages the Wav2Vec 2.0 framework, pre-trained on a comprehensive podcast dataset, to extract a broad range of audio features encompassing both linguistic and paralinguistic elements. We enhance feature representation through a fusion technique that integrates individual features with a global mean vector, introducing global contextual insights into our analysis. Additionally, we incorporate a pre-trained valence- arousal-dominance (VAD) module from the Wav2Vec 2.0 model. Our fusion employs a Long Short-Term Memory (LSTM) architecture for efficient temporal analysis of audio data. Utilizing only the provided audio data, our approach demonstrates significant improvements over the established baseline.
翻译:在本研究中,我们针对第六届野外情感行为分析研讨会与竞赛中的情感模仿强度(EMI)估计任务提出了一种方法。该方法利用基于大规模播客数据集预训练的Wav2Vec 2.0框架,提取涵盖语言及副语言元素的广泛音频特征。我们通过一种融合技术增强特征表示,该技术将个体特征与全局均值向量相结合,从而将全局上下文信息引入分析。此外,我们还融入了来自Wav2Vec 2.0模型的预训练效价-唤醒-支配(VAD)模块。我们的融合采用长短期记忆(LSTM)架构,以实现对音频数据的高效时序分析。仅利用提供的音频数据,我们的方法相比既定基线展现了显著改进。