Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K.The code and models can be accessed via the project page.
翻译:语义图像合成(SIS)在传感器仿真中展现出良好前景。然而,当前基于生成对抗网络(GAN)的最佳实践尚未达到理想质量水平。随着潜扩散模型在图像生成领域取得重大进展,我们开始评估以密集控制能力著称的ControlNet方法。研究发现其存在两大主要问题:大语义区域中的异常子结构,以及内容与语义掩膜的对齐错位。通过实证研究,我们确定这些问题的根源在于加噪训练数据分布与推理阶段使用的标准正态先验之间的不匹配。为解决这一挑战,我们为SIS开发了专用噪声先验,包括空间先验、类别先验,以及一种全新的推理用空间-类别联合先验。我们将该方法命名为SCP-Diff,其在Cityscapes数据集上实现了10.54的FID,在ADE20K数据集上达到12.66。相关代码与模型可通过项目页面获取。