Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.
翻译:近期图像生成器在单图生成与编辑中展现出惊人的逼真度与指令遵循能力。然而受限于自身架构,它们无法实现文本-图像序列的交错式生成,而该能力在视觉叙事、引导与具身操作中至关重要。即便是最新的开源统一多模态模型在此方面的表现也相当有限。本文提出InterleaveThinker——首个旨在为任意现有图像生成器赋予交错式生成能力的多智能体流水线。具体而言,我们采用规划器智能体组织图像-文本输入序列,指导图像生成器各步骤所需执行的操作;进而引入评判器智能体评估生成器输出,识别偏离规划指令的样本并修正指令进行再生。为实现该流水线,我们构建了Interleave-Planner-SFT-80k与Interleave-Critic-SFT-112k数据集以完成格式冷启动,随后开发Interleave-Critic-RL-13k数据集,通过GRPO在生成轨迹内强化逐步指令修正能力。由于单条交错式生成轨迹可能涉及超过25次生成器调用,优化整条轨迹的算力代价过高。为此我们提出准确率奖励与逐步奖励机制,使单步强化学习能有效引导整条生成轨迹。实验表明,InterleaveThinker在多种图像生成器上均有性能提升。在交错式生成基准测试中,其效果可与Nano Banana及GPT-5媲美。值得注意的是,该方法在基于推理的基准上同样显著增强了基础模型——例如在四步FLUX.2-klein中,WISE与RISE指标均获得大幅提升。