Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at https://github.com/measure-infinity/mulan-code.
翻译:现有文本到图像模型在生成包含多个对象的图像时仍面临困难,尤其是在处理对象间的空间位置、相对大小、重叠关系和属性绑定方面。为有效应对这些挑战,我们开发了一种无需训练的多模态大语言模型智能体(MuLan),其如同人类画家一般,能够通过精细规划和反馈控制逐步生成多对象图像。MuLan利用大语言模型(LLM)将输入提示分解为一系列子任务,每个子任务仅通过稳定扩散模型生成单个对象,并以先前生成的对象为条件。与现有基于LLM的方法不同,MuLan仅在初始阶段制定高层规划,而每个对象的具体尺寸和位置则由LLM和注意力引导机制在各子任务执行时动态确定。此外,MuLan采用视觉语言模型(VLM)对每个子任务生成的图像提供反馈,并在图像违反原始提示时控制扩散模型重新生成。因此,MuLan每一步中的每个模型只需处理其擅长的简单子任务。这种多步骤流程还允许人类用户在生成过程中随时监控,并可通过文本提示在任意中间步骤进行偏好调整,从而提升人机协作体验。我们从不同基准数据集中收集了200个包含空间关系和属性绑定的多对象提示用于评估MuLan。实验结果表明,MuLan在多对象生成方面优于基线方法,并在与人类用户协作时展现出卓越的创造性。代码已开源:https://github.com/measure-infinity/mulan-code。