Modern text-to-vision generative models often hallucinate when the prompt describing the scene to be generated is underspecified. In large language models (LLMs), a prevalent strategy to reduce hallucinations is to retrieve factual knowledge from an external database. While such retrieval augmentation strategies have great potential to enhance text-to-vision generators, existing static top-K retrieval methods explore the knowledge pool once, missing the broader context necessary for high-quality generation. Furthermore, LLMs internally possess rich world knowledge learned during large-scale training (parametric knowledge) that could mitigate the need for external data retrieval. This paper proposes Contextual Knowledge Pursuit (CKPT), a framework that leverages the complementary strengths of external and parametric knowledge to help generators produce reliable visual content. Instead of the one-time retrieval of facts from an external database to improve a given prompt, CKPT uses (1) an LLM to decide whether to seek external knowledge or to self-elicit descriptions from LLM parametric knowledge, (2) a knowledge pursuit process to contextually seek and sequentially gather most relevant facts, (3) a knowledge aggregator for prompt enhancement with the gathered fact context, and (4) a filtered fine-tuning objective to improve visual synthesis with richer prompts. We evaluate CKPT across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of rare objects and daily scenarios. Our results show that CKPT is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising data source for zero-shot synthesis and filtered fine-tuning of text-to-vision generative models.
翻译:现代文本到视觉生成模型在描述待生成场景的提示信息不充分时,常会产生幻觉现象。在大型语言模型(LLM)中,一种普遍用于减少幻觉的策略是从外部数据库检索事实性知识。尽管此类检索增强策略在提升文本到视觉生成器方面潜力巨大,但现有的静态Top-K检索方法仅对知识库进行一次探索,未能获取高质量生成所需的更广泛上下文。此外,LLM内部通过大规模训练习得了丰富的世界知识(参数化知识),这或许能减少对外部数据检索的依赖。本文提出上下文知识追踪(CKPT)框架,该框架综合利用外部知识与参数化知识的互补优势,以帮助生成器产出可靠的视觉内容。与从外部数据库一次性检索事实以改进给定提示的传统方法不同,CKPT采用以下机制:(1)利用LLM决策是否寻求外部知识或从LLM参数化知识中自我激发描述;(2)通过知识追踪过程在上下文中搜索并顺序收集最相关事实;(3)通过知识聚合器利用收集到的事实上下文增强提示信息;(4)采用过滤式微调目标,通过更丰富的提示改进视觉合成。我们在包含稀有物体和日常场景的数据集上,针对多种文本驱动生成任务(图像、3D渲染和视频)对CKPT进行评估。实验结果表明,CKPT能够在多样化的视觉领域中生成忠实且语义丰富的内容,为零样本合成及文本到视觉生成模型的过滤式微调提供了有前景的数据源。