Multimodal Conversational Emotion (MCE) detection, generally spanning across the acoustic, vision and language modalities, has attracted increasing interest in the multimedia community. Previous studies predominantly focus on learning contextual information in conversations with only a few considering the topic information in single language modality, while always neglecting the acoustic and vision topic information. On this basis, we propose a model-agnostic Topic-enriched Diffusion (TopicDiff) approach for capturing multimodal topic information in MCE tasks. Particularly, we integrate the diffusion model into neural topic model to alleviate the diversity deficiency problem of neural topic model in capturing topic information. Detailed evaluations demonstrate the significant improvements of TopicDiff over the state-of-the-art MCE baselines, justifying the importance of multimodal topic information to MCE and the effectiveness of TopicDiff in capturing such information. Furthermore, we observe an interesting finding that the topic information in acoustic and vision is more discriminative and robust compared to the language.
翻译:多模态对话情感检测通常涵盖声学、视觉和语言模态,近年来在多媒体社区引起了越来越多的关注。以往研究主要关注对话中的上下文信息学习,仅有少数工作考虑了单一语言模态中的主题信息,而始终忽略了声学与视觉中的主题信息。基于此,我们提出了一种模型无关的主题增强扩散方法 TopicDiff,用于在情感检测任务中捕获多模态主题信息。具体而言,我们将扩散模型集成到神经主题模型中,以缓解神经主题模型在捕获主题信息时多样性不足的问题。详细评估表明,TopicDiff 相较于最先进的多模态对话情感基线方法取得了显著改进,证实了多模态主题信息对情感检测的重要性以及 TopicDiff 在捕获此类信息上的有效性。此外,我们观察到一项有趣的发现:与语言模态相比,声学和视觉模态中的主题信息更具判别性和鲁棒性。