Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal attention. Based on this finding, we propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model, which dynamically chooses the primary modality in each training batch and reduces fusion times by leveraging the learned hierarchy in the latent space to alleviate incongruity. The experimental evaluation on five benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP (sentiment and emotion), where incongruity implicitly lies in hard samples, as well as UR-FUNNY (humour) and MUStaRD (sarcasm), where incongruity is common, verifies the efficacy of our approach, showing that HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
翻译:融合多种模态已被证实对多模态信息处理有效。然而,模态之间的非一致性给多模态融合带来了挑战,尤其在情感识别中。本研究首先分析了一种模态中的显著情感信息如何受到其他模态的影响,并证明模态间非一致性潜在地存在于跨模态注意力中。基于此发现,我们提出了具有动态模态门控的层次化跨模态Transformer(HCT-DMG),这是一种轻量级的非一致性感知模型,能够在每个训练批次中动态选择主模态,并通过利用潜在空间中学习到的层次结构减少融合次数以缓解非一致性。在五个基准数据集上的实验评估:CMU-MOSI、CMU-MOSEI和IEMOCAP(情感和情绪,其中非一致性隐含存在于困难样本中),以及UR-FUNNY(幽默)和MUStaRD(讽刺,其中非一致性普遍存在),验证了我们方法的有效性,结果表明HCT-DMG:1)以约0.8M参数的更小规模超越了以往的多模态模型;2)能够识别出非一致性导致情感识别困难的困难样本;3)在潜在层面上缓解了跨模态注意力中的非一致性。