Layout-aware text-to-image generation is a task to generate multi-object images that reflect layout conditions in addition to text conditions. The current layout-aware text-to-image diffusion models still have several issues, including mismatches between the text and layout conditions and quality degradation of generated images. This paper proposes a novel layout-aware text-to-image diffusion model called NoiseCollage to tackle these issues. During the denoising process, NoiseCollage independently estimates noises for individual objects and then crops and merges them into a single noise. This operation helps avoid condition mismatches; in other words, it can put the right objects in the right places. Qualitative and quantitative evaluations show that NoiseCollage outperforms several state-of-the-art models. These successful results indicate that the crop-and-merge operation of noises is a reasonable strategy to control image generation. We also show that NoiseCollage can be integrated with ControlNet to use edges, sketches, and pose skeletons as additional conditions. Experimental results show that this integration boosts the layout accuracy of ControlNet. The code is available at https://github.com/univ-esuty/noisecollage.
翻译:布局感知的文本到图像生成是一项在文本条件之外额外反映布局条件的多目标图像生成任务。当前的布局感知文本到图像扩散模型仍存在文本与布局条件不匹配以及生成图像质量退化等问题。本文提出一种名为NoiseCollage的新型布局感知文本到图像扩散模型以解决这些问题。在去噪过程中,NoiseCollage独立估计各物体的噪声,随后通过裁剪与融合操作将其合并为单一噪声。这一操作有助于避免条件不匹配,换言之,能将正确的物体放置在正确的位置。定性与定量评估表明,NoiseCollage优于多个现有最优模型。这些成功结果揭示了噪声的裁剪与融合操作是控制图像生成的合理策略。我们还展示了NoiseCollage可与ControlNet集成,以利用边缘、草图及姿态骨架作为附加条件。实验结果表明,该集成提升了ControlNet的布局精度。代码已开源至 https://github.com/univ-esuty/noisecollage。