处理弱互补关系以实现音视频情感识别 (Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition)

Multimodal emotion recognition has recently drawn a lot of interest in affective computing as it has immense potential to outperform isolated unimodal approaches. Audio and visual modalities are two predominant contact-free channels in videos, which are often expected to carry a complementary relationship with each other. However, audio and visual channels may not always be complementary with each other, resulting in poor audio-visual feature representations, thereby degrading the performance of the system. In this paper, we propose a flexible audio-visual fusion model that can adapt to weak complementary relationships using a gated attention mechanism. Specifically, we extend the recursive joint cross-attention model by introducing gating mechanism in every iteration to control the flow of information between the input features and the attended features depending on the strength of their complementary relationship. For instance, if the modalities exhibit strong complementary relationships, the gating mechanism chooses cross-attended features, otherwise non-attended features. To further improve the performance of the system, we further introduce stage gating mechanism, which is used to control the flow of information across the gated outputs of each iteration. Therefore, the proposed model improves the performance of the system even when the audio and visual modalities do not have a strong complementary relationship with each other by adding more flexibility to the recursive joint cross attention mechanism. The proposed model has been evaluated on the challenging Affwild2 dataset and significantly outperforms the state-of-the-art fusion approaches.

翻译：多模态情感识别因其超越孤立单模态方法的巨大潜力，近期在情感计算领域引起了广泛关注。音频与视觉模态是视频中两种主要的非接触式通道，通常预期彼此间存在互补关系。然而，音频与视觉通道并非始终具有互补性，这可能导致音视频特征表示质量低下，进而降低系统性能。本文提出一种灵活的音视频融合模型，该模型可通过门控注意力机制适应弱互补关系。具体而言，我们在递归联合交叉注意力模型的每次迭代中引入门控机制，根据互补关系的强度控制输入特征与注意力特征之间的信息流动。例如，若模态间呈现强互补关系，门控机制选择交叉注意力特征，反之则选择非注意力特征。为进一步提升系统性能，我们还引入了阶段门控机制，用于控制每次迭代的门控输出之间的信息流动。因此，即使音频与视觉模态间不存在强互补关系，所提模型仍能通过增强递归联合交叉注意力机制的灵活性来改善系统性能。该模型已在具有挑战性的Affwild2数据集上进行评估，其性能显著优于当前最先进的融合方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/