Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.
翻译:多模态意图识别旨在利用表情、身体动作和语音语调等多种模态来理解用户意图,这是真实世界多模态场景中理解人类语言和行为的关键任务。然而,现有方法大多忽视了不同模态间的潜在关联,且在有效学习非语言模态的语义特征方面存在局限性。本文提出一种融合模态感知提示的令牌级对比学习方法(TCL-MAP)以应对上述挑战。为建立文本模态的最优多模态语义环境,我们开发了模态感知提示模块(MAP),该模块通过基于相似性的模态对齐和跨模态注意力机制,有效对齐并融合文本、视频和音频模态的特征。基于模态感知提示和真实标签,所提出的令牌级对比学习框架(TCL)构建增强样本,并在标签令牌上应用NT-Xent损失函数。具体而言,TCL利用从意图标签中提取的最优文本语义洞察,反向指导其他模态的学习过程。大量实验表明,与现有最优方法相比,本方法实现了显著性能提升。此外,消融分析证明了模态感知提示优于人工设计提示,这对多模态提示学习具有重要价值。代码已开源至https://github.com/thuiar/TCL-MAP。