As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models. However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets. In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance. Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation. Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available.
翻译:在即时通讯中分享图像已成为关键要素,因此学习图像-文本多模态对话模型的研究日益活跃。然而,由于现有数据集中每段对话对应的图像质量低且多样性有限,训练具备良好泛化能力的多模态对话模型仍面临挑战。本文提出一种自动流水线,在最小化人工投入的同时,兼顾对话质量与图像多样性来构建多模态对话数据集。在该流水线中,为确保图像与对话的连贯性,我们引导GPT-4推断可能的图像共享时刻——具体包括话语、说话者、理由及图像描述。此外,利用CLIP相似度保持对齐多张图像与话语的一致性。通过此流水线,我们构建了DialogCC数据集,其在人工评估中质量与多样性均超越现有数据集。全面实验表明,基于该数据集训练的多模态对话模型在未见对话数据集上的泛化性能显著提升。我们已公开源代码与数据集。