Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

翻译：近年来，使机器在对话场景下的多模态情境中理解人类情感已成为研究热点，相关任务称为对话多模态情感分析（MM-ERC）。MM-ERC持续受到关注，研究者提出了多种方法以提升任务性能。现有工作大多将MM-ERC视为标准多模态分类问题，通过多模态特征解耦与融合最大化特征效用。然而，在重新审视MM-ERC的特性后，我们认为特征的多模态性和对话上下文性应在特征解耦与融合步骤中同时得到恰当建模。本研究基于上述洞见，旨在进一步推动任务性能提升。一方面，在特征解耦阶段，我们基于对比学习技术设计了一种双层解耦机制（DDM），将特征分别解耦至模态空间和话语空间。另一方面，在特征融合阶段，我们分别提出贡献感知融合机制（CFM）和上下文再融合机制（CRM），用于多模态和上下文整合。两者协同调度多模态与上下文特征的合理融合。具体而言，CFM动态显式管理多模态特征贡献，而CRM灵活协调对话上下文的引入。在两个公开的MM-ERC数据集上，我们的系统持续取得新的最优性能。进一步分析表明，我们提出的所有机制通过自适应充分利用多模态和上下文特征，显著促进了MM-ERC任务。值得注意的是，我们的方法具有推动更广泛对话多模态任务的巨大潜力。