Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.
翻译:近年来,扩散模型在图像合成领域取得了显著成功。然而,在布局到图像生成任务中,图像通常包含多物体的复杂场景,如何同时对全局布局图和每个具体对象实现强控制仍是一个挑战。本文提出名为LayoutDiffusion的扩散模型,相比先前工作可获得更高的生成质量和更强的可控性。为解决图像与布局之间多模态融合的难题,我们提出构建包含区域信息的结构图像块,并将该图像块转换为特殊布局,使其以统一形式与常规布局融合。此外,设计了布局融合模块(LFM)和对象感知交叉注意力(OaCA)来建模多对象间的关系,并通过对象感知与位置敏感机制实现对空间相关信息的精准控制。大量实验表明,我们的LayoutDiffusion在COCO-stuff数据集上的FID和CAS指标上分别超出先前最优方法46.35%和26.70%,在VG数据集上超出44.29%和41.82%。代码已开源至https://github.com/ZGCTroy/LayoutDiffusion。