Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.
翻译:尽管现有图像生成与编辑方法已取得显著成果,当前模型在处理复杂问题(如精细文本描述)时仍面临挑战,且缺乏验证与自校正机制导致生成图像可靠性不足。同时,单一模型往往专精于特定任务并具备相应能力,难以满足用户所有需求。本文提出GenArtist——一个由多模态大语言模型(MLLM)智能体协调的统一图像生成与编辑系统。我们将现有各类模型整合至工具库,并利用智能体进行工具选择与执行。面对复杂问题,MLLM智能体将其分解为更简单的子问题,构建树状结构以系统规划生成、编辑和自校正流程,并实施逐步验证。通过自动生成缺失的位置相关输入并融入位置信息,可有效调用适配工具处理各子问题。实验表明,GenArtist能够执行多样化的生成与编辑任务,取得最先进的性能表现,超越SDXL、DALL-E 3等现有模型(如图1所示)。项目页面详见 https://zhenyuw16.github.io/GenArtist_page。