Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) achieves better performance than prior work as well as manual selection of the primary modality; 2) can recognize hard samples whose emotions are hard to tell; 3) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 4) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.
翻译:融合多种模态用于情感计算任务已被证明能有效提升性能。然而,多模态融合的运作机制尚不明确,且其在现实场景中的应用通常导致模型规模过大。本文针对情感与情绪分析任务,首先分析了跨模态注意力中单一模态的显著情感信息如何受其他模态影响。我们发现,由于跨模态注意力机制,模态间在潜在层面存在不一致性。基于此发现,我们提出了一种轻量级模型——基于门控机制的分层跨模态变换器(HCT-MG)。该模型根据各模态对目标任务的贡献确定主模态,并分层整合辅助模态以缓解模态间不一致性、降低信息冗余。在三个基准数据集(CMU-MOSI、CMU-MOSEI、IEMOCAP)上的实验验证了该方法有效性,结果表明:1)其性能优于前人工作及手动选择主模态的方法;2)能够识别情绪难以判别的困难样本;3)在模态情感倾向不匹配时,缓解了潜在层面的模态间不一致性;4)在模型参数量缩减至不足1M的同时,性能优于同类规模现有模型。