Multi-modal intent detection aims to utilize various modalities to understand the user's intentions, which is essential for the deployment of dialogue systems in real-world scenarios. The two core challenges for multi-modal intent detection are (1) how to effectively align and fuse different features of modalities and (2) the limited labeled multi-modal intent training data. In this work, we introduce a shallow-to-deep interaction framework with data augmentation (SDIF-DA) to address the above challenges. Firstly, SDIF-DA leverages a shallow-to-deep interaction module to progressively and effectively align and fuse features across text, video, and audio modalities. Secondly, we propose a ChatGPT-based data augmentation approach to automatically augment sufficient training data. Experimental results demonstrate that SDIF-DA can effectively align and fuse multi-modal features by achieving state-of-the-art performance. In addition, extensive analyses show that the introduced data augmentation approach can successfully distill knowledge from the large language model.
翻译:多模态意图检测旨在利用多种模态理解用户意图,对于对话系统在实际场景中的部署至关重要。多模态意图检测面临两个核心挑战:(1)如何有效对齐与融合不同模态的特征;(2)标注的多模态意图训练数据有限。本文提出一种带数据增强的浅到深交互框架(SDIF-DA)以应对上述挑战。首先,SDIF-DA利用浅到深交互模块,逐步且有效地对齐并融合文本、视频和音频模态的特征。其次,我们提出一种基于ChatGPT的数据增强方法,自动扩充充足的训练数据。实验结果表明,SDIF-DA通过取得最优性能,能够有效对齐与融合多模态特征。此外,大量分析显示,所提出的数据增强方法能够成功地从大语言模型中蒸馏知识。