Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.
翻译:图像基础对话系统通过整合视觉信息显著受益,从而生成高质量响应。然而,现有模型在零资源场景下难以有效利用此类信息,主要源于图像与文本模态之间的差异。为克服这一挑战,我们提出了一种创新性多模态框架ZRIGF,该框架能在零资源条件下吸收图像基础信息以驱动对话生成。ZRIGF采用两阶段学习策略,包括对比预训练与生成预训练。对比预训练阶段包含一个文本-图像匹配模块(将图像与文本映射至统一编码向量空间)以及一个文本辅助掩码图像建模模块(用于保留预训练视觉特征并促进进一步的多模态特征对齐)。生成预训练阶段则通过多模态融合模块与信息迁移模块,基于协调的多模态表征生成富有洞见的响应。在文本基础对话数据集与图像基础对话数据集上进行的全面实验表明,ZRIGF能够生成上下文相关且信息丰富的响应。此外,我们在图像基础对话数据集中采用完全零资源场景,验证了该框架在新领域中的强大泛化能力。相关代码已开源至https://github.com/zhangbo-nlp/ZRIGF。