Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has set new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes. The code and models can be accessed via the project page.
翻译:语义图像合成在传感器仿真领域展现出良好的应用前景。然而,当前该领域基于生成对抗网络的最佳实践尚未达到理想的质量水平。随着潜在扩散模型在图像生成方面取得显著进展,我们开始评估ControlNet——一种以其密集控制能力而著称的重要方法。我们的研究发现其生成结果存在两个主要问题:大语义区域内存在异常子结构,以及生成内容与语义掩码未对齐。通过实证研究,我们将这些问题归因于加噪训练数据分布与推理阶段应用的标准正态先验之间的不匹配。为应对这一挑战,我们为语义图像合成开发了特定的噪声先验,包括空间先验、类别先验以及一种用于推理的新型空间-类别联合先验。我们将此方法命名为SCP-Diff,该方法在Cityscapes、ADE20K和COCO-Stuff数据集上的语义图像合成任务中取得了最新的最优结果,在Cityscapes数据集上实现了低至10.53的FID分数。代码与模型可通过项目页面获取。