Text-to-image generation has shown remarkable progress with the emergence of diffusion models. However, these models often generate factually inconsistent images, failing to accurately reflect the factual information and common sense conveyed by the input text prompts. We refer to this issue as Image hallucination. Drawing from studies on hallucinations in language models, we classify this problem into three types and propose a methodology that uses factual images retrieved from external sources to generate realistic images. Depending on the nature of the hallucination, we employ off-the-shelf image editing tools, either InstructPix2Pix or IP-Adapter, to leverage factual information from the retrieved image. This approach enables the generation of images that accurately reflect the facts and common sense.
翻译:随着扩散模型的出现,文本到图像生成技术取得了显著进展。然而,这些模型经常生成与事实不符的图像,未能准确反映输入文本提示所传达的事实信息和常识。我们将此问题称为图像幻觉。借鉴语言模型中幻觉现象的研究,我们将该问题分为三种类型,并提出一种利用从外部来源检索的事实图像来生成逼真图像的方法。根据幻觉的性质,我们采用现成的图像编辑工具(InstructPix2Pix 或 IP-Adapter),以利用检索图像中的事实信息。这种方法能够生成准确反映事实和常识的图像。