Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
翻译:多模态意图识别提出了重大挑战,它需要结合现实场景中的非语言模态来增强对人类意图的理解。现有的基准数据集在规模上存在局限,并且难以处理在多轮对话交互中出现的范围外样本。我们推出了MIntRec2.0,一个用于多方对话中多模态意图识别的大规模基准数据集。它包含1,245个对话和15,040个样本,每个样本均按照一个新的包含30个细粒度类别的意图分类体系进行标注。除了9,304个范围内样本外,它还包含了在多轮上下文中出现的5,736个范围外样本,这些样本在现实场景中自然出现。此外,我们为每个话语提供了说话者的全面信息,增强了其在多方对话研究中的实用性。我们建立了一个通用框架,支持组织单轮和多轮对话数据、模态特征提取、多模态融合,以及范围内分类和范围外检测。评估基准是使用经典的多模态融合方法、ChatGPT和人类评估者构建的。虽然现有的结合非语言信息的方法带来了性能提升,但有效利用上下文信息和检测范围外样本仍然是一个重大挑战。值得注意的是,大语言模型与人类相比表现出显著的性能差距,这凸显了机器学习方法在认知意图理解任务中的局限性。我们相信,MIntRec2.0将成为一个宝贵的资源,为人机对话交互研究提供一个开创性的基础,并极大地促进相关应用的发展。完整数据集和代码可在 https://github.com/thuiar/MIntRec2.0 获取。