Currently, dialogue systems have achieved high performance in processing text-based communication. However, they have not yet effectively incorporated visual information, which poses a significant challenge. Furthermore, existing models that incorporate images in dialogue generation focus on discussing the image itself. Our proposed approach presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue. By doing so, we aim to expand the capabilities of current dialogue systems and transition them from single modality (text) to multi-modality. However, there is a lack of validated English datasets that contain both images and dialogue contexts for this task. Thus, we propose a two-stage approach to automatically construct a multi-modal dialogue dataset. In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image. In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model. We used this approach, along with additional labeling, to create the IMage Augmented multi-modal Dialogue dataset (IMAD), which can serve as a validated dataset for this task. Furthermore, we propose a baseline model trained on this dataset, which outperforms model trained on the same data without images and BlenderBot.
翻译:当前对话系统在处理基于文本的交流方面已取得高性能表现,但尚未有效整合视觉信息,这构成了重大挑战。此外,现有在对话生成中融入图像的模型主要聚焦于对图像本身的讨论。我们提出的方法为多模态对话系统提供了全新视角,即在对话语境中解读图像。通过这一方式,我们旨在扩展当前对话系统的能力,使其从单一模态(文本)向多模态过渡。然而,目前缺乏经过验证的、同时包含图像与对话语境的英文数据集来支持该任务。因此,我们提出一种两阶段方法来自动构建多模态对话数据集。第一阶段利用文本-图像相似度与句子相似度识别哪些话语可被图像替代;第二阶段通过选取相关图像子集并利用视觉问答模型进行筛选来替代这些话语。我们采用该方法并结合额外标注,创建了图像增强的多模态对话数据集(IMAD),可作为该任务的经验证数据集。此外,我们提出基于该数据集训练的基线模型,其性能优于在相同数据上未使用图像训练的模型以及BlenderBot。