The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems. Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively. This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges. We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality. Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation. Our experimental results show substantial improvements over existing baseline methods.
翻译:多模态意图识别(MIR)的目标是利用文本、视频和音频等多种模态来检测用户意图,这对于对话系统中理解人类语言和上下文至关重要。尽管该领域已取得进展,但仍存在两个主要挑战:(1)如何从鲁棒的文本特征中有效提取和利用语义信息;(2)如何有效地将非语言模态与语言模态对齐并融合。本文提出了一种基于常识知识提取的文本增强方法(TECO)来解决这些挑战。我们首先从生成和检索的知识中提取关系,以丰富文本模态的上下文信息。随后,我们将视觉和声学表征与这些增强的文本特征对齐并整合,形成统一的多模态表征。实验结果表明,相较于现有基线方法,我们的方法取得了显著提升。