Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.
翻译:近年来,多模态情感识别因其能够利用多种模态(如音频、视觉、生物信号等)之间多样且互补的关系,并对噪声模态具有一定的鲁棒性而备受关注。目前大多数先进的音频-视觉融合方法依赖于循环网络或传统的注意力机制,这些方法未能有效利用音频与视觉模态之间的互补特性。本文聚焦于基于视频中提取的面部与语音模态融合的维度情感识别。具体而言,我们提出了一种联合交叉注意力模型,该模型利用互补关系来提取音频-视觉模态间的显著特征,从而实现对效价与唤醒度连续值的准确预测。所提出的融合模型有效利用了模态间的关系,同时减少了特征间的异质性。该模型尤其通过结合特征表示与各单一模态之间的相关性来计算交叉注意力权重。通过将结合后的音频-视觉特征表示引入交叉注意力模块,我们的融合模块性能相比基础交叉注意力模块有显著提升。在AffWild2数据集的验证集视频上的实验结果表明,我们提出的音频-视觉融合模型提供了一种成本效益高的解决方案,其性能优于当前最先进的方法。相关代码已在GitHub上开源:https://github.com/praveena2j/JointCrossAttentional-AV-Fusion。