In the accelerating era of human-instructed visual content creation, diffusion models have demonstrated remarkable generative potential. Yet their deployment is constrained by a dual bottleneck: semantic ambiguity in diverse prompts and the narrow specialization of individual models. A single diffusion architecture struggles to maintain optimal performance across heterogeneous prompts, while conventional "parse-then-call" pipelines artificially separate semantic understanding from generative execution. To bridge this gap, we introduce DiffusionAgent, a unified, language-model-driven agent that casts the entire "prompt comprehension-expert routing-image synthesis" loop into a agentic framework. Our contributions are three-fold: (1) a tree-of-thought-powered expert navigator that performs fine-grained semantic parsing and zero-shot matching to the most suitable diffusion model via an extensible prior-knowledge tree; (2) an advantage database updated with human-in-the-loop feedback, continually aligning model-selection policy with human aesthetic and semantic preferences; and (3) a fully decoupled agent architecture that activates the optimal generative path for open-domain prompts without retraining or fine-tuning any expert. Extensive experiments show that DiffusionAgent retains high generation quality while significantly broadening prompt coverage, establishing a new performance and generality benchmark for multi-domain image synthesis. The code is available at https://github.com/DiffusionAgent/DiffusionAgent
翻译:在人类指令驱动的视觉内容创作加速发展的时代,扩散模型已展现出卓越的生成潜力。然而,其实际部署受到双重瓶颈的制约:多样化提示中的语义模糊性,以及单个模型的专业范围狭窄。单一的扩散架构难以在异构提示下保持最优性能,而传统的“解析-调用”流程人为地将语义理解与生成执行割裂开来。为弥合这一鸿沟,我们提出了DiffusionAgent,一个统一的、由语言模型驱动的智能体,它将完整的“提示理解-专家路由-图像合成”循环纳入智能体框架。我们的贡献包括三个方面:(1)一种基于思维树的专家导航器,通过可扩展的先验知识树执行细粒度语义解析与零样本匹配,从而选择最合适的扩散模型;(2)一个基于人机协同反馈持续更新的优势数据库,使模型选择策略不断与人类的审美及语义偏好对齐;(3)完全解耦的智能体架构,能够为开放域提示激活最优生成路径,而无需对任何专家模型进行重新训练或微调。大量实验表明,DiffusionAgent在保持高生成质量的同时,显著拓宽了提示覆盖范围,为多领域图像合成建立了新的性能与泛化能力基准。代码已发布于 https://github.com/DiffusionAgent/DiffusionAgent