Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition

Multimodal analysis has recently drawn much interest in affective computing, since it can improve the overall accuracy of emotion recognition over isolated uni-modal approaches. The most effective techniques for multimodal emotion recognition efficiently leverage diverse and complimentary sources of information, such as facial, vocal, and physiological modalities, to provide comprehensive feature representations. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos, where complex spatiotemporal relationships may be captured. Most of the existing fusion techniques rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complimentary nature of audio-visual (A-V) modalities. We introduce a cross-attentional fusion approach to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. Our new cross-attentional A-V fusion model efficiently leverages the inter-modal relationships. In particular, it computes cross-attention weights to focus on the more contributive features across individual modalities, and thereby combine contributive feature representations, which are then fed to fully connected layers for the prediction of valence and arousal. The effectiveness of the proposed approach is validated experimentally on videos from the RECOLA and Fatigue (private) data-sets. Results indicate that our cross-attentional A-V fusion model is a cost-effective approach that outperforms state-of-the-art fusion approaches. Code is available: \url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion}

翻译：近年来，多模态分析在情感计算领域引起了广泛关注，因为它能够通过整合多源信息来提高情感识别的整体准确性，优于孤立的单模态方法。最有效的多模态情感识别技术能够高效利用多样且互补的信息源（如面部、语音和生理模态），以提供全面的特征表示。本文重点研究基于视频中提取的面部与语音模态融合的维度情感识别，该方法能够捕捉复杂的时空关系。现有的融合技术大多依赖于循环网络或传统的注意力机制，未能有效利用视听模态的互补特性。我们提出了一种跨注意力融合方法，用于提取视听模态间的显著特征，从而实现对效价和唤醒度连续值的准确预测。我们提出的新型跨注意力视听融合模型能够高效利用模态间的关联关系。具体而言，该方法通过计算跨注意力权重来聚焦各模态中贡献度更高的特征，进而融合这些具有贡献度的特征表示，最后将融合特征输入全连接层进行效价与唤醒度的预测。通过在RECOLA和Fatigue（私有）数据集的视频上进行实验验证，证明了所提方法的有效性。结果表明，我们的跨注意力视听融合模型是一种高性价比的解决方案，其性能优于当前最先进的融合方法。代码已开源：\url{https://github.com/praveena2j/Cross-Attentional-AV-Fusion}