Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

翻译：对话场景下使机器能够理解多模态语境中的人类情感一直是研究热点，该任务被称为对话多模态情感分析（MM-ERC）。近年来，MM-ERC持续受到关注，研究者提出了多种方法以提升任务性能。现有工作大多将MM-ERC视为标准的多模态分类问题，通过多模态特征解耦与融合最大化特征效用。然而，在重新审视MM-ERC特性后，我们认为在特征解耦与融合步骤中，应同时合理建模特征的多模态性与对话语境化。本文旨在充分考虑上述观点以进一步推动任务性能。一方面，在特征解耦阶段，基于对比学习技术，我们设计了双层解耦机制（DDM），将特征解耦至模态空间和话语空间。另一方面，在特征融合阶段，我们分别提出贡献感知融合机制（CFM）和上下文重融合机制（CRM），用于多模态与上下文整合。二者共同调度多模态特征与上下文特征的合理融合。具体而言，CFM显式动态管理多模态特征贡献，而CRM灵活协调对话上下文的引入。在两个公开MM-ERC数据集上，我们的系统持续取得新的最优性能。进一步分析表明，我们提出的所有机制通过自适应充分利用多模态与上下文特征，极大地促进了MM-ERC任务。值得注意的是，所提方法具有强大潜力，可促进更广泛的对话多模态任务。