UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

翻译：能够进行交错生成的统一模型已成为一种有前景的范式，学界日益趋向于文本的自回归建模与图像生成中的流匹配相结合。为推进这一方向，我们提出了一种针对交错生成的统一强化学习框架。我们以该框架的基本单元——单轮推理驱动的图像生成为例验证其有效性，在此过程中模型首先通过推理扩展用户提示，随后进行图像合成。我们将这一多模态生成过程建模为具有稀疏终端奖励的马尔可夫决策过程，并引入UniGRPO，利用GRPO联合优化文本与图像生成策略。为避免过度设计，我们采用简约方法论，通过无缝集成用于推理的标准GRPO与用于视觉合成的FlowGRPO，充分利用两种模态已建立的训练方案。为确保可扩展至多轮交错生成，我们对原始FlowGRPO进行了两项关键改进：(1) 消除无分类器引导以保持线性、无分支的rollout，这对于扩展至涉及多轮交互与多条件生成（如编辑）的复杂场景至关重要；(2) 将标准潜在空间KL惩罚替换为直接作用于速度场的MSE惩罚，提供更稳健且直接的正则化信号，以有效缓解奖励操纵问题。实验表明，这种统一训练方法通过推理显著提升了图像生成质量，为未来完全交错模型的后训练提供了稳健且可扩展的基准。