We used two multimodal models for continuous valence-arousal recognition using visual, audio, and linguistic information. The first model is the same as we used in ABAW2 and ABAW3, which employs the leader-follower attention. The second model has the same architecture for spatial and temporal encoding. As for the fusion block, it employs a compact and straightforward channel attention, borrowed from the End2You toolkit. Unlike our previous attempts that use Vggish feature directly as the audio feature, this time we feed the pre-trained VGG model using logmel-spectrogram and finetune it during the training. To make full use of the data and alleviate over-fitting, cross-validation is carried out. The fold with the highest concordance correlation coefficient is selected for submission. The code is to be available at https://github.com/sucv/ABAW5.
翻译:我们利用视觉、音频和语言信息,采用两种多模态模型进行连续效价-唤醒度识别。第一种模型与我们在ABAW2和ABAW3中使用的模型相同,采用领导者-跟随者注意力机制。第二种模型在空间和时间编码方面采用相同架构。对于融合模块,它采用紧凑且直接的通道注意力机制,该机制借鉴自End2You工具包。与以往直接使用Vggish特征作为音频特征的尝试不同,本次我们使用基于对数梅尔谱图的预训练VGG模型,并在训练过程中对其进行微调。为充分利用数据并缓解过拟合,我们进行了交叉验证。选择具有最高一致性相关系数的折数进行提交。代码将公开于https://github.com/sucv/ABAW5。