MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. In this paper, we develop a training-free Multimodal-LLM agent (MuLan) to address these challenges by progressive multi-object generation with planning and feedback control, like a human painter. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object conditioned on previously generated objects by stable diffusion. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined by an LLM and attention guidance upon each sub-task. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines. The code is available on https://github.com/measure-infinity/mulan-code.

翻译：现有文生图模型在生成包含多个目标的图像时仍存在困难，尤其在处理空间位置、相对尺寸、重叠关系及属性绑定等方面。本文提出一种无需训练的多模态大语言模型智能体（MuLan），通过类似人类画师的渐进式多目标生成策略（含规划与反馈控制机制）解决上述挑战。MuLan利用大语言模型将提示词分解为子任务序列，每个子任务基于稳定扩散模型先前生成的目标生成单一目标。与现有基于大语言模型的方法不同，MuLan仅在初始阶段生成高层规划，而每个目标的具体尺寸与位置则由大语言模型结合注意力引导机制在子任务执行过程中确定。此外，MuLan采用视觉语言模型对每个子任务生成的图像提供反馈，当生成结果违反原始提示词时，控制扩散模型进行图像重生成。因此，MuLan各步骤中的每个模型只需处理其擅长且简化的子任务。我们从不同基准测试中收集了200个包含空间关系与属性绑定的多目标提示词用于评估MuLan。结果表明，MuLan在多目标生成任务上显著优于基线方法。代码已开源至https://github.com/measure-infinity/mulan-code。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日