Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for layer-wise image generation that requires neither fine-tuning nor additional data. TAUE embeds global structural information from intermediate denoising latents into the initial noise to preserve spatial coherence, and integrates semantic cues through cross-layer attention sharing to maintain contextual and visual consistency across layers. Extensive experiments demonstrate that TAUE achieves state-of-the-art performance among training-free methods, delivering image quality comparable to fine-tuned models while improving inter-layer consistency. Moreover, it enables new applications, such as layout-aware editing, multi-object composition, and background replacement, indicating potential for interactive, layer-separated generation systems in real-world creative workflows.
翻译:尽管文本到图像扩散模型取得了显著成功,但其输出的单一扁平化图像仍然是需要分层控制专业应用的关键瓶颈。现有解决方案要么依赖于使用大型、难以获取的数据集进行微调,要么虽免训练但仅限于生成孤立的前景元素,无法产生完整连贯的场景。为解决这一问题,我们提出了免训练噪声移植与培育扩散模型(TAUE),这是一种无需微调或额外数据的分层图像生成新框架。TAUE将来自中间去噪隐空间的全局结构信息嵌入初始噪声以保持空间连贯性,并通过跨层注意力共享整合语义线索,以维持各层间的上下文与视觉一致性。大量实验表明,TAUE在免训练方法中实现了最先进的性能,在达到与微调模型相当的图像质量的同时,提升了层间一致性。此外,该框架支持布局感知编辑、多对象组合和背景替换等新应用,展现了在实际创意工作流中构建交互式分层生成系统的潜力。