Recursive Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition

Multi-modal emotion recognition has recently gained a lot of attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and text. Most state-of-the-art methods for multimodal fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of the modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial, vocal, and text modalities extracted from videos. Specifically, we propose a recursive cross-modal attention (RCMA) to effectively capture the complementary relationships across the modalities in a recursive fashion. The proposed model is able to effectively capture the inter-modal relationships by computing the cross-attention weights across the individual modalities and the joint representation of the other two modalities. To further improve the inter-modal relationships, the obtained attended features of the individual modalities are again fed as input to the cross-modal attention to refine the feature representations of the individual modalities. In addition to that, we have used Temporal convolution networks (TCNs) to capture the temporal modeling (intra-modal relationships) of the individual modalities. By deploying the TCNs as well cross-modal attention in a recursive fashion, we are able to effectively capture both intra- and inter-modal relationships across the audio, visual, and text modalities. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed fusion model is able to achieve significant improvement over the baseline for the sixth challenge of Affective Behavior Analysis in-the-Wild 2024 (ABAW6) competition.

翻译：多模态情感识别近年来受到了广泛关注，因为它能利用音频、视觉和文本等多种模态之间的多样性和互补关系。现有的大多数多模态融合方法依赖于循环网络或传统注意力机制，未能有效利用模态间的互补性。本文聚焦于基于视频中提取的面部、声音和文本模态融合的维度情感识别。具体而言，我们提出了一种递归跨模态注意力（RCMA），以递归方式有效捕捉模态间的互补关系。该模型通过计算各单独模态与其余两种模态联合表示之间的跨模态注意力权重，能够有效捕捉模态间关系。为进一步增强模态间关系，将获得的各模态注意力特征再次作为跨模态注意力的输入，以细化各模态的特征表示。此外，我们使用时序卷积网络（TCNs）捕捉各模态的时间建模（模态内关系）。通过递归部署TCNs和跨模态注意力，我们能够有效捕捉音频、视觉和文本模态间的模态内和模态间关系。在AffWild2数据集的验证集视频上的实验结果表明，我们提出的融合模型相比基线在2024年野外情感行为分析第六届挑战赛（ABAW6）中取得了显著改进。