In this study, we propose a methodology for the Emotional Mimicry Intensity (EMI) Estimation task within the context of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our approach leverages the Wav2Vec 2.0 framework, pre-trained on a comprehensive podcast dataset, to extract a broad range of audio features encompassing both linguistic and paralinguistic elements. We enhance feature representation through a fusion technique that integrates individual features with a global mean vector, introducing global contextual insights into our analysis. Additionally, we incorporate a pre-trained valence-arousal-dominance (VAD) module from the Wav2Vec 2.0 model. Our fusion employs a Long Short-Term Memory (LSTM) architecture for efficient temporal analysis of audio data. Utilizing only the provided audio data, our approach demonstrates significant improvements over the established baseline.
翻译:本研究针对第六届野外情感行为分析研讨会与竞赛中的情感模仿强度(EMI)估计任务,提出了一种方法论。我们的方法利用在综合播客数据集上预训练的Wav2Vec 2.0框架,提取涵盖语言和副语言元素的广泛音频特征。通过一种融合技术增强特征表示,该技术将个体特征与全局均值向量相结合,从而引入全局上下文信息。此外,我们融入了Wav2Vec 2.0模型中预训练的效价-唤醒-支配(VAD)模块。融合过程采用长短期记忆(LSTM)架构对音频数据进行高效时序分析。仅使用提供的音频数据,我们的方法相较于既定基线展现出显著改进。