Multimodal fusion is a significant method for most multimodal tasks. With the recent surge in the number of large pre-trained models, combining both multimodal fusion methods and pre-trained model features can achieve outstanding performance in many multimodal tasks. In this paper, we present our approach, which leverages both advantages for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation. We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features. Following preprocessing and interpolation or convolution to align the extracted features, different models are employed for modal fusion. Our code is available at GitHub - FulgenceWen/ABAW6th.
翻译:多模态融合是大多数多模态任务的重要方法。随着近年来大型预训练模型数量的激增,将多模态融合方法与预训练模型特征相结合,可以在许多多模态任务中取得卓越性能。本文提出了我们的方法,该方法同时利用两者的优势来解决表情识别与效价-唤醒度评估任务。我们使用预训练模型评估Aff-Wild2数据库,然后提取模型的最终隐藏层作为特征。经过预处理以及通过插值或卷积对齐提取的特征后,采用不同模型进行模态融合。我们的代码可在GitHub - FulgenceWen/ABAW6th获取。