InterleaveThinker: Reinforcing Agentic Interleaved Generation

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

翻译：最近的图像生成器在单图像生成与编辑中展现了惊人的逼真度和指令遵循能力。然而，受限于其架构，它们无法实现交织生成（图文序列），而这在视觉叙事、引导和具身操作中具有关键应用。即使是最新的开源统一多模态模型在此方面的表现也有限。在本文中，我们提出InterleaveThinker，这是首个旨在赋予任何现有图像生成器交织生成能力的多智能体流水线。具体而言，我们采用规划智能体来组织图像-文本输入序列，指示图像生成器每一步所需的执行操作。随后，我们引入评判智能体来评估生成器的输出，识别偏离规划指令的样本，并优化指令以进行重新生成。为实现此流水线，我们构建了Interleave-Planner-SFT-80k和Interleave-Critic-SFT-112k数据集以进行格式冷启动。接着，我们开发了Interleave-Critic-RL-13k数据集，利用GRPO强化生成轨迹中逐步指令修正能力。由于单个交织生成轨迹可能涉及超过25次生成器调用，优化整个轨迹在计算上不可行。因此，我们提出准确率奖励和逐步奖励，使单步强化学习能有效引导整个生成轨迹。结果表明，InterleaveThinker提升了多种图像生成器的性能。在交织生成基准上，它达到了与Nano Banana和GPT-5相当的性能。令人惊讶的是，它还显著增强了基础模型在推理基准上的表现；例如，在4步FLUX.2-klein上，我们在WISE和RISE上获得了显著提升。