Text to image generation methods (T2I) are widely popular in generating art and other creative artifacts. While visual hallucinations can be a positive factor in scenarios where creativity is appreciated, such artifacts are poorly suited for cases where the generated image needs to be grounded in complex natural language without explicit visual elements. In this paper, we propose to strengthen the consistency property of T2I methods in the presence of natural complex language, which often breaks the limits of T2I methods by including non-visual information, and textual elements that require knowledge for accurate generation. To address these phenomena, we propose a Natural Language to Verified Image generation approach (NL2VI) that converts a natural prompt into a visual prompt, which is more suitable for image generation. A T2I model then generates an image for the visual prompt, which is then verified with VQA algorithms. Experimentally, aligning natural prompts with image generation can improve the consistency of the generated images by up to 11% over the state of the art. Moreover, improvements can generalize to challenging domains like cooking and DIY tasks, where the correctness of the generated image is crucial to illustrate actions.
翻译:文本到图像生成方法(T2I)在艺术创作及其他创意性产物的生成中广受欢迎。尽管视觉幻觉在需要发挥创造力的场景中可作为积极因素,但在生成图像需基于复杂自然语言且缺乏显式视觉元素的情境下,此类产物往往难以适用。本文提出强化T2I方法在处理自然语言复杂性时的一致性属性——自然语言常因包含非视觉信息及需借助知识方能精准生成的文本要素而突破T2I方法的局限。针对上述现象,我们提出了一种"自然语言到可验证图像生成"方法(NL2VI),将自然语言提示转换为更适用于图像生成的视觉提示。随后由T2I模型生成对应视觉提示的图像,并通过VQA算法进行一致性验证。实验表明,将自然语言提示与图像生成对齐后,生成图像的一致性较现有最优方法提升最高达11%。此外,本方法的改进还可泛化至烹饪、DIY操作等具有挑战性的领域——这些场景中生成图像的正确性对动作演示至关重要。