A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

R. Gnana Praveen,Wheidima Carneiro de Melo,Nasib Ullah,Haseeb Aslam,Osama Zeeshan,Théo Denorme,Marco Pedersoli,Alessandro Koerich,Simon Bacon,Patrick Cardinal,Eric Granger

from arxiv, arXiv admin note: text overlap with arXiv:2111.05222

Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion.

翻译：近年来，多模态情感识别因其能够利用多种模态（如音频、视觉、生物信号等）之间多样且互补的关系，并对噪声模态具有一定的鲁棒性而备受关注。目前大多数先进的音频-视觉融合方法依赖于循环网络或传统的注意力机制，这些方法未能有效利用音频与视觉模态之间的互补特性。本文聚焦于基于视频中提取的面部与语音模态融合的维度情感识别。具体而言，我们提出了一种联合交叉注意力模型，该模型利用互补关系来提取音频-视觉模态间的显著特征，从而实现对效价与唤醒度连续值的准确预测。所提出的融合模型有效利用了模态间的关系，同时减少了特征间的异质性。该模型尤其通过结合特征表示与各单一模态之间的相关性来计算交叉注意力权重。通过将结合后的音频-视觉特征表示引入交叉注意力模块，我们的融合模块性能相比基础交叉注意力模块有显著提升。在AffWild2数据集的验证集视频上的实验结果表明，我们提出的音频-视觉融合模型提供了一种成本效益高的解决方案，其性能优于当前最先进的方法。相关代码已在GitHub上开源：https://github.com/praveena2j/JointCrossAttentional-AV-Fusion。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日