Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen,Quanxin Shou,Hangting Chen,Yucheng Zhou,Kaituo Feng,Wenbo Hu,Yi-Fan Zhang,Yunlong Lin,Wenxuan Huang,Mingyang Song,Dasen Dai,Bolin Jiang,Manyuan Zhang,Shi-Xue Zhang,Zhengkai Jiang,Lucas Wang,Zhao Zhong,Yu Cheng,Nanyun Peng

from arxiv, Project Page: https://github.com/shawn0728/Unify-Agent

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

翻译：统一多模态模型为理解多样复杂的真实世界知识并生成高质量图像提供了自然且有前景的架构。然而，这类模型仍主要依赖冻结的参数化知识，因此在涉及长尾及知识密集型概念的现实图像生成任务中面临挑战。受智能体在真实世界任务中广泛成功的启发，我们探索采用智能体建模来突破这一局限。具体而言，我们提出了Unify-Agent——一种面向世界感知图像生成的统一多模态智能体。该模型将图像生成重构为一个智能体流水线，包含提示理解、多模态证据搜索、基于上下文的重新描述和最终合成四个阶段。为训练模型，我们构建了定制化多模态数据流水线，并精心整理了14.3万条用于世界感知图像生成的高质量智能体轨迹，从而实现对完整智能体生成过程的有效监督。此外，我们引入了FactIP基准测试，涵盖12类具有文化重要性及长尾特征的事实概念，明确要求依赖外部知识进行生成。大量实验表明，我们提出的Unify-Agent在各类基准测试和真实世界生成任务中较其基础统一模型有显著提升，同时逼近最强闭源模型的世界知识能力。作为面向世界感知图像生成的基于智能体建模的早期探索，本研究凸显了紧密耦合推理、搜索与生成过程对实现可靠开放世界智能体图像生成的重要价值。