We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6\%) and WISE (+14\%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.
翻译:我们提出GenAgent,通过一个智能多模态模型统一视觉理解与生成。与面临高昂训练成本及理解-生成权衡的统一模型不同,GenAgent通过智能框架解耦了这些能力:理解由多模态模型自身处理,而生成则通过将图像生成模型视为可调用工具来实现。关键的是,与受限于静态流程的现有模块化系统不同,此设计支持自主多轮交互,其中智能体生成包含推理、工具调用、判断与反思的多模态思维链,以迭代优化输出。我们采用两阶段训练策略:首先,在高质量工具调用与反思数据上进行监督微调冷启动,以引导智能体行为;其次,进行端到端的智能强化学习,结合点式奖励(最终图像质量)与对式奖励(反思准确性),并通过轨迹重采样增强多轮探索能力。GenAgent在GenEval++(+23.6%)和WISE(+14%)基准上显著提升了基础生成器(FLUX.1-dev)的性能。除性能提升外,我们的框架展现出三个关键特性:1)对具有不同能力的生成器具备跨工具泛化性;2)测试时随交互轮次增加而持续提升的扩展性;3)能自动适应不同任务的任务自适应推理能力。我们的代码将在\href{https://github.com/deep-kaixun/GenAgent}{此链接}公开。