Despite impressive progress in high-fidelity image synthesis, generative models still struggle with logic-intensive instruction following, exposing a persistent reasoning--execution gap. Meanwhile, closed-source systems (e.g., Nano Banana) have demonstrated strong reasoning-driven image generation, highlighting a substantial gap to current open-source models. We argue that closing this gap requires not merely better visual generators, but executable reasoning: decomposing high-level intents into grounded, verifiable plans that directly steer the generative process. To this end, we propose Unified Thinker, a task-agnostic reasoning architecture for general image generation, designed as a unified planning core that can plug into diverse generators and workflows. Unified Thinker decouples a dedicated Thinker from the image Generator, enabling modular upgrades of reasoning without retraining the entire generative model. We further introduce a two-stage training paradigm: we first build a structured planning interface for the Thinker, then apply reinforcement learning to ground its policy in pixel-level feedback, encouraging plans that optimize visual correctness over textual plausibility. Extensive experiments on text-to-image generation and image editing show that Unified Thinker substantially improves image reasoning and generation quality.
翻译:尽管高保真图像合成取得了令人瞩目的进展,生成模型在遵循逻辑密集型指令方面仍然存在困难,暴露出持续的推理-执行鸿沟。与此同时,闭源系统(例如Nano Banana)已展现出强大的推理驱动图像生成能力,突显了与当前开源模型之间的显著差距。我们认为,弥合这一差距不仅需要更好的视觉生成器,更需要可执行的推理:将高层意图分解为有根据、可验证的计划,以直接引导生成过程。为此,我们提出统一思考者,一种用于通用图像生成的任务无关推理架构,其设计为一个统一的规划核心,可插入不同的生成器和工作流程。统一思考者将专用的思考者与图像生成器解耦,使得无需重新训练整个生成模型即可对推理模块进行升级。我们进一步引入两阶段训练范式:首先为思考者构建结构化规划接口,然后应用强化学习将其策略基于像素级反馈进行落地,鼓励制定优化视觉正确性而非文本合理性的计划。在文本到图像生成和图像编辑上的大量实验表明,统一思考者显著提升了图像推理和生成质量。