Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) outperforms major prior work by achieving competitive results and can successfully recognize hard samples; 2) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 3) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.
翻译:融合多种模态以提高情感计算任务的性能已被证明有效。然而,多模态融合的机制尚未得到充分理解,且在实际应用中通常导致模型规模庞大。本研究聚焦于情感与情绪分析,首先分析了在跨模态注意力机制下,一种模态中的显著情感信息如何受到其他模态的影响。研究发现,跨模态注意力会在潜在层面引发模态间的不一致性。基于此发现,我们提出了一种轻量级模型——层级化跨模态变换器与模态门控(HCT-MG),该模型根据各模态对目标任务贡献度确定主模态,随后层级化地整合辅助模态,以缓解模态间不一致性并减少信息冗余。在三个基准数据集(CMU-MOSI、CMU-MOSEI 和 IEMOCAP)上的实验评估验证了本方法的有效性:1)在取得竞争性结果的同时优于主要先前工作,并能成功识别困难样本;2)在模态情感倾向不匹配时,缓解了潜在层面的模态间不一致性;3)在模型参数规模缩减至不足100万的情况下,仍优于同类尺寸的现有模型。