Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.
翻译:图像对话系统通过融合视觉信息显著提升了响应生成质量。然而,现有模型在零资源场景中难以有效利用此类信息,主要障碍在于图像与文本模态间的语义差异。为解决这一挑战,我们提出名为ZRIGF的创新多模态框架,该框架在零资源条件下为对话生成同化图像相关特征信息。ZRIGF采用两阶段学习策略,包含对比预训练与生成预训练。对比预训练阶段包含文本-图像匹配模块(将图像与文本映射至统一编码向量空间),以及文本辅助掩码图像建模模块(保留预训练视觉特征并促进多模态特征对齐)。生成预训练阶段通过多模态融合模块与信息传递模块,基于融合的多模态表征生成具有洞察力的响应。在基于文本和图像对话数据集上的综合实验表明,ZRIGF能生成上下文相关且信息丰富的响应。此外,我们在图像对话数据集中采用完全零资源场景,验证了本框架在新领域中的强泛化能力。代码见https://github.com/zhangbo-nlp/ZRIGF。