Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
翻译:利用跨模态的互补关系近来在多模态情感识别领域引起了广泛关注。现有方法大多探索跨注意力机制以捕捉模态间的互补关系。然而,模态间也可能表现出较弱的互补关系,这可能会劣化跨注意力特征,导致多模态特征表示效果不佳。为解决此问题,我们提出了不一致性感知跨注意力机制,该机制能够根据音频与视觉模态间强或弱的互补关系,动态自适应地选择最相关的特征。具体而言,我们设计了一个两阶段门控机制,能够自适应地选择合适的相关特征以处理弱互补关系。我们在具有挑战性的Aff-Wild2数据集上进行了大量实验,以证明所提出模型的鲁棒性。