Text-to-image generation (T2I) has become a key area of research with broad applications. However, existing methods often struggle with complex spatial relationships and fine-grained control over multiple concepts. Many existing approaches require significant architectural modifications, extensive training, or expert-level prompt engineering. To address these challenges, we introduce \textbf{LayerCraft}, an automated framework that leverages large language models (LLMs) as autonomous agents for structured procedural generation. LayerCraft enables users to customize objects within an image and supports narrative-driven creation with minimal effort. At its core, the system includes a coordinator agent that directs the process, along with two specialized agents: \textbf{ChainArchitect}, which employs chain-of-thought (CoT) reasoning to generate a dependency-aware 3D layout for precise instance-level control, and the \textbf{Object-Integration Network (OIN)}, which utilizes LoRA fine-tuning on pre-trained T2I models to seamlessly blend objects into specified regions of an image based on textual prompts without requiring architectural changes. Extensive evaluations demonstrate LayerCraft's versatility in applications ranging from multi-concept customization to storytelling. By providing non-experts with intuitive, precise control over T2I generation, our framework democratizes creative image creation. Our code will be released upon acceptance at github.com/PeterYYZhang/LayerCraft
翻译:文本到图像生成已成为具有广泛应用前景的关键研究领域。然而,现有方法在处理复杂空间关系及对多概念的细粒度控制方面仍面临挑战。许多现有方案需要进行显著的架构修改、大量训练或专家级提示工程。为应对这些挑战,我们提出了 \textbf{LayerCraft}——一个利用大型语言模型作为自主代理实现结构化程序化生成的自动化框架。该框架使用户能够自定义图像中的对象,并以最小工作量支持叙事驱动的创作。其核心系统包含一个协调流程的调度代理,以及两个专用代理:\textbf{ChainArchitect} 通过思维链推理生成具有依赖感知的三维布局以实现精确的实例级控制;\textbf{对象集成网络} 则基于预训练文本到图像模型进行LoRA微调,无需修改架构即可根据文本提示将对象无缝融合至图像指定区域。大量实验评估证明了LayerCraft在从多概念定制到故事叙述等应用场景中的卓越适应性。通过为非专业用户提供直观精确的文本到图像生成控制,本框架实现了创意图像创作的平民化。我们的代码将在论文录用后发布于 github.com/PeterYYZhang/LayerCraft