Multi-modal emotion recognition in conversations is a challenging problem due to the complex and complementary interactions between different modalities. Audio and textual cues are particularly important for understanding emotions from a human perspective. Most existing studies focus on exploring interactions between audio and text modalities at the same representation level. However, a critical issue is often overlooked: the heterogeneous modality gap between low-level audio representations and high-level text representations. To address this problem, we propose a novel framework called Heterogeneous Bimodal Attention Fusion (HBAF) for multi-level multi-modal interaction in conversational emotion recognition. The proposed method comprises three key modules: the uni-modal representation module, the multi-modal fusion module, and the inter-modal contrastive learning module. The uni-modal representation module incorporates contextual content into low-level audio representations to bridge the heterogeneous multi-modal gap, enabling more effective fusion. The multi-modal fusion module uses dynamic bimodal attention and a dynamic gating mechanism to filter incorrect cross-modal relationships and fully exploit both intra-modal and inter-modal interactions. Finally, the inter-modal contrastive learning module captures complex absolute and relative interactions between audio and text modalities. Experiments on the MELD and IEMOCAP datasets demonstrate that the proposed HBAF method outperforms existing state-of-the-art baselines.
翻译:对话中的多模态情感识别是一个具有挑战性的问题,这源于不同模态之间复杂且互补的相互作用。从人类视角理解情感时,音频和文本线索尤为重要。现有研究大多聚焦于探索音频与文本模态在同一表征层次上的交互。然而,一个关键问题常被忽视:低层次音频表征与高层次文本表征之间存在异质模态鸿沟。为解决此问题,我们提出了一种名为异质双模态注意力融合的新框架,用于对话情感识别中的多层次多模态交互。该方法包含三个关键模块:单模态表征模块、多模态融合模块以及模态间对比学习模块。单模态表征模块将上下文内容融入低层次音频表征,以弥合异质多模态鸿沟,从而实现更有效的融合。多模态融合模块采用动态双模态注意力与动态门控机制,以过滤错误的跨模态关系,并充分利用模态内与模态间的交互。最后,模态间对比学习模块捕捉音频与文本模态之间复杂的绝对与相对交互关系。在MELD和IEMOCAP数据集上的实验表明,所提出的HBAF方法优于现有的先进基线模型。