Multimodal Classification and Out-of-distribution Detection for Multimodal Intent Understanding

Multimodal intent understanding is a significant research area that requires effectively leveraging multiple modalities to analyze human language. Existing methods face two main challenges in this domain. Firstly, they have limitations in capturing nuanced and high-level semantics underlying complex in-distribution (ID) multimodal intents. Secondly, they exhibit poor generalization when confronted with unseen out-of-distribution (OOD) data in real-world scenarios. To address these issues, we propose a novel method for both ID classification and OOD detection (MIntOOD). We first introduce a weighted feature fusion network that models multimodal representations effectively. This network dynamically learns the importance of each modality, adapting to multimodal contexts. To develop discriminative representations that are conducive to both tasks, we synthesize pseudo-OOD data from convex combinations of ID data and engage in multimodal representation learning from both coarse-grained and fine-grained perspectives. The coarse-grained perspective focuses on distinguishing between ID and OOD binary classes, while the fine-grained perspective enhances the understanding of ID data by incorporating binary confidence scores. These scores help to gauge the difficulty of each sample, improving the classification of different ID classes. Additionally, the fine-grained perspective captures instance-level interactions between ID and OOD samples, promoting proximity among similar instances and separation from dissimilar ones. We establish baselines for three multimodal intent datasets and build an OOD benchmark. Extensive experiments on these datasets demonstrate that our method significantly improves OOD detection performance with a 3-10% increase in AUROC scores while achieving new state-of-the-art results in ID classification. The full data and codes are available at https://github.com/thuiar/MIntOOD.

翻译：多模态意图理解是一个重要的研究领域，需要有效利用多种模态来分析人类语言。现有方法在该领域面临两个主要挑战。首先，它们在捕捉复杂分布内多模态意图背后微妙且高级的语义方面存在局限。其次，在面对现实场景中未见过的分布外数据时，它们表现出较差的泛化能力。为解决这些问题，我们提出了一种同时处理分布内分类和分布外检测的新方法。我们首先引入了一个加权特征融合网络，该网络能有效建模多模态表示。该网络动态学习每种模态的重要性，以适应多模态语境。为了构建有利于两项任务的判别性表示，我们从分布内数据的凸组合中合成伪分布外数据，并从粗粒度和细粒度两个视角进行多模态表示学习。粗粒度视角侧重于区分分布内和分布外二元类别，而细粒度视角则通过引入二元置信度分数来增强对分布内数据的理解。这些分数有助于评估每个样本的难度，从而改进对不同分布内类别的分类。此外，细粒度视角捕捉了分布内与分布外样本之间的实例级交互，促进相似实例间的接近性以及与不相似实例的分离。我们为三个多模态意图数据集建立了基线，并构建了一个分布外检测基准。在这些数据集上的大量实验表明，我们的方法显著提升了分布外检测性能，AUROC分数提高了3-10%，同时在分布内分类任务上取得了新的最先进结果。完整数据和代码可在 https://github.com/thuiar/MIntOOD 获取。