Hallucinations and unfaithful synthesis due to inaccurate prompts with insufficient semantic details are widely observed in multimodal generative models. A prevalent strategy to align multiple modalities is to fine-tune the generator with a large number of annotated text-image pairs. However, such a procedure is labor-consuming and resource-draining. The key question we ask is: can we enhance the quality and faithfulness of text-driven generative models beyond extensive text-image pair annotations? To address this question, we propose Knowledge Pursuit Prompting (KPP), a zero-shot framework that iteratively incorporates external knowledge to help generators produce reliable visual content. Instead of training generators to handle generic prompts, KPP employs a recursive knowledge query process to gather informative external facts from the knowledge base, instructs a language model to compress the acquired knowledge for prompt refinement, and utilizes text-driven generators for visual synthesis. The entire process is zero-shot, without accessing the architectures and parameters of generative models. We evaluate the framework across multiple text-driven generative tasks (image, 3D rendering, and video) on datasets of different domains. We further demonstrate the extensibility and adaptability of KPP through varying foundation model bases and instructions. Our results show that KPP is capable of generating faithful and semantically rich content across diverse visual domains, offering a promising solution to improve multimodal generative models.
翻译:幻觉以及因提示语不准确且缺乏足够语义细节导致的不忠实合成,在多模态生成模型中普遍存在。一种对齐多种模态的常见策略是使用大量标注的文本-图像对来微调生成器。然而,这类过程既耗费人力又消耗资源。我们提出的关键问题是:能否在不依赖海量文本-图像对标注的情况下,提升文本驱动生成模型的质量与忠实度?针对这一问题,我们提出了知识追求提示方法(KPP),这是一种零样本框架,通过迭代整合外部知识来帮助生成器产出可靠的视觉内容。KPP并不训练生成器处理通用提示,而是采用递归式知识查询过程从知识库中收集信息丰富的外部事实,指导语言模型压缩获取的知识以优化提示,并利用文本驱动生成器进行视觉合成。整个过程为零样本,无需访问生成模型的架构与参数。我们在不同领域的多个数据集上,对图像、3D渲染和视频等文本驱动的生成任务评估了该框架。我们进一步通过变更基础模型基座与指令,展示了KPP的可扩展性与适应性。实验结果表明,KPP能够在多样化的视觉领域中生成忠实且语义丰富的内容,为改进多模态生成模型提供了一种有前景的解决方案。