Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. In this paper, we develop a training-free Multimodal-LLM agent (MuLan) to address these challenges by progressive multi-object generation with planning and feedback control, like a human painter. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object conditioned on previously generated objects by stable diffusion. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined by an LLM and attention guidance upon each sub-task. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines. The code is available on https://github.com/measure-infinity/mulan-code.
翻译:现有文生图模型在生成包含多个目标的图像时仍存在困难,尤其在处理空间位置、相对尺寸、重叠关系及属性绑定等方面。本文提出一种无需训练的多模态大语言模型智能体(MuLan),通过类似人类画师的渐进式多目标生成策略(含规划与反馈控制机制)解决上述挑战。MuLan利用大语言模型将提示词分解为子任务序列,每个子任务基于稳定扩散模型先前生成的目标生成单一目标。与现有基于大语言模型的方法不同,MuLan仅在初始阶段生成高层规划,而每个目标的具体尺寸与位置则由大语言模型结合注意力引导机制在子任务执行过程中确定。此外,MuLan采用视觉语言模型对每个子任务生成的图像提供反馈,当生成结果违反原始提示词时,控制扩散模型进行图像重生成。因此,MuLan各步骤中的每个模型只需处理其擅长且简化的子任务。我们从不同基准测试中收集了200个包含空间关系与属性绑定的多目标提示词用于评估MuLan。结果表明,MuLan在多目标生成任务上显著优于基线方法。代码已开源至https://github.com/measure-infinity/mulan-code。