Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models. Within just two years of development, it was unprecedentedly of high-quality, diversity, and creativity that the state-of-the-art models could generate. However, a prevalent limitation persists in the effective communication with these popular T2I models, such as Stable Diffusion, using natural language descriptions. This typically makes an engaging image hard to obtain without expertise in prompt engineering with complex word compositions, magic tags, and annotations. Inspired by the recently released DALLE3 - a T2I model directly built-in ChatGPT that talks human language, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I), where people can interact with LLM for interleaved high-quality image generation/edit/refinement and question answering with stronger images and text correspondences using natural language. In addressing the iT2I problem, we present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models. We evaluate our approach for iT2I in a variety of common-used scenarios under different LLMs, e.g., ChatGPT, LLAMA, Baichuan, and InternLM. We demonstrate that our approach could be a convenient and low-cost way to introduce the iT2I ability for any existing LLMs and any text-to-image models without any training while bringing little degradation on LLMs' inherent capabilities in, e.g., question answering and code generation. We hope this work could draw broader attention and provide inspiration for boosting user experience in human-machine interactions alongside the image quality of the next-generation T2I systems.

翻译：摘要：随着文本到图像（T2I）扩散模型的蓬勃发展，人工智能内容生成的革命正迅速加速。在短短两年的发展历程中，最先进的模型以前所未有的高质量、多样性和创造力生成内容。然而，在使用自然语言描述与这些流行的T2I模型（如Stable Diffusion）进行有效沟通时，仍存在普遍局限。通常，若缺乏提示工程领域对复杂词语组合、魔法标签和注释的专业知识，很难获得引人入胜的图像。受近期发布的DALLE3（一款直接内置于ChatGPT、可理解人类语言的T2I模型）启发，我们重新审视了现有致力于对齐人类意图的T2I系统，并引入一项新任务——交互式文本到图像（iT2I）。在该任务中，用户可通过自然语言与大型语言模型交互，实现高质量图像的生成/编辑/细化，以及基于更强图文对应关系的问答。为解决iT2I问题，我们提出一种简单方法：通过提示技术结合现成T2I模型，增强大型语言模型的iT2I能力。我们在不同大型语言模型（如ChatGPT、LLAMA、百川和InternLM）的多种常见场景下评估了该方法。实验表明，该方法可以便捷、低成本地为任意现有大语言模型和任意文本到图像模型引入iT2I能力，且无需任何训练，同时几乎不降低大语言模型在问答、代码生成等固有任务上的性能。我们希望这项工作能引起广泛关注，并为提升下一代T2I系统的图像质量与用户体验提供启发。