Leveraging complementary relationships across modalities has recently drawn a lot of attention in multimodal emotion recognition. Most of the existing approaches explored cross-attention to capture the complementary relationships across the modalities. However, the modalities may also exhibit weak complementary relationships, which may deteriorate the cross-attended features, resulting in poor multimodal feature representations. To address this problem, we propose Inconsistency-Aware Cross-Attention (IACA), which can adaptively select the most relevant features on-the-fly based on the strong or weak complementary relationships across audio and visual modalities. Specifically, we design a two-stage gating mechanism that can adaptively select the appropriate relevant features to deal with weak complementary relationships. Extensive experiments are conducted on the challenging Aff-Wild2 dataset to show the robustness of the proposed model.
翻译:利用模态间的互补关系近期在多模态情感识别中受到广泛关注。现有方法大多通过交叉注意力机制捕获模态间的互补关系。然而,当模态间呈现弱互补关系时,交叉注意力特征可能退化,导致多模态特征表示质量下降。为解决该问题,我们提出不一致性感知交叉注意力(IACA),该方法能根据音频与视觉模态间互补关系的强弱,自适应地动态选择最相关特征。具体而言,我们设计了一个两阶段门控机制,可自适应筛选恰当的相关特征以处理弱互补关系。在具有挑战性的Aff-Wild2数据集上进行的充分实验证明了所提模型的鲁棒性。