MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at https://github.com/measure-infinity/mulan-code.

翻译：现有文本到图像模型在生成包含多个对象的图像时仍面临困难，尤其是在处理对象间的空间位置、相对大小、重叠关系和属性绑定方面。为有效应对这些挑战，我们开发了一种无需训练的多模态大语言模型智能体（MuLan），其如同人类画家一般，能够通过精细规划和反馈控制逐步生成多对象图像。MuLan利用大语言模型（LLM）将输入提示分解为一系列子任务，每个子任务仅通过稳定扩散模型生成单个对象，并以先前生成的对象为条件。与现有基于LLM的方法不同，MuLan仅在初始阶段制定高层规划，而每个对象的具体尺寸和位置则由LLM和注意力引导机制在各子任务执行时动态确定。此外，MuLan采用视觉语言模型（VLM）对每个子任务生成的图像提供反馈，并在图像违反原始提示时控制扩散模型重新生成。因此，MuLan每一步中的每个模型只需处理其擅长的简单子任务。这种多步骤流程还允许人类用户在生成过程中随时监控，并可通过文本提示在任意中间步骤进行偏好调整，从而提升人机协作体验。我们从不同基准数据集中收集了200个包含空间关系和属性绑定的多对象提示用于评估MuLan。实验结果表明，MuLan在多对象生成方面优于基线方法，并在与人类用户协作时展现出卓越的创造性。代码已开源：https://github.com/measure-infinity/mulan-code。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日