Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code will be publicly available following acceptance.
翻译:将多模态知识融入大语言模型(LLMs)是提升对话生成能力的重要进展。然而,在零资源场景下有效整合此类知识仍面临重大挑战,主要源于高质量多样化对话数据集的稀缺。为此,我们提出视觉隐式知识蒸馏框架(VIKDF),该创新方法通过利用隐式多模态知识增强LLMs在零资源环境下的对话生成能力。VIKDF包含两个主要阶段:知识蒸馏阶段利用隐式查询变换器从图像-文本对中提取并编码视觉隐式知识为知识向量;知识整合阶段采用新型双向变分信息融合技术,将蒸馏得到的知识向量无缝融入LLMs。这使得LLMs不仅能生成连贯且引人入胜的对话,更能通过隐式多模态线索展现对上下文的深度理解,有效突破零资源场景的限制。我们在两个对话数据集上的大量实验表明,VIKDF在生成高质量对话方面优于现有最先进模型。代码将在论文接收后公开。