Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

翻译：文本到图像扩散模型的最新进展已能根据多样化的文本提示生成高质量的视觉内容。然而，现有的大多数文本到图像扩散模型，即便是那些配备了基于大语言模型的文本编码器的系统，本质上仍是文本到像素的映射器——它们仅将大语言模型用作文本编码器，未能利用其内在的推理能力来推断给定文本提示应呈现何种视觉内容。为了超越这种字面生成模式，我们提出了“思维后生成”范式，该范式鼓励基于大语言模型的文本编码器对原始用户提示进行推理和重写，并将重写后提示的状态作为扩散条件。为实现这一目标，我们首先通过轻量级的监督微调过程激活大语言模型编码器的“思维后重写”模式。随后，通过Dual-GRPO方法对大语言模型编码器和扩散主干网络进行协同优化，以确保对上下文的忠实推理和语义的准确渲染。具体而言，文本编码器通过基于图像的奖励机制得到强化，以推断和调用世界知识；而扩散主干网络则被推动生成语义一致且视觉连贯的图像。实验表明，在基于推理的图像生成和编辑基准测试中，模型在事实一致性、语义对齐和视觉真实感方面均取得显著提升，WISE分数达到0.79，与GPT-4的表现近乎持平。我们的研究成果为构建兼具推理、表达和演示能力的下一代统一模型迈出了坚实的一步。