Semantic image synthesis (SIS) is a task to generate realistic images corresponding to semantic maps (labels). It can be applied to diverse real-world practices such as photo editing or content creation. However, in real-world applications, SIS often encounters noisy user inputs. To address this, we propose Stochastic Conditional Diffusion Model (SCDM), which is a robust conditional diffusion model that features novel forward and generation processes tailored for SIS with noisy labels. It enhances robustness by stochastically perturbing the semantic label maps through Label Diffusion, which diffuses the labels with discrete diffusion. Through the diffusion of labels, the noisy and clean semantic maps become similar as the timestep increases, eventually becoming identical at $t=T$. This facilitates the generation of an image close to a clean image, enabling robust generation. Furthermore, we propose a class-wise noise schedule to differentially diffuse the labels depending on the class. We demonstrate that the proposed method generates high-quality samples through extensive experiments and analyses on benchmark datasets, including a novel experimental setup simulating human errors during real-world applications.
翻译:语义图像合成(SIS)是一项根据语义图(标签)生成逼真图像的任务。它可应用于照片编辑或内容生成等多种实际场景。然而在实际应用中,SIS常面临噪声用户输入的挑战。为此,我们提出随机条件扩散模型(SCDM),这是一种专为含噪声标签的SIS任务设计的鲁棒条件扩散模型,其创新性地引入了面向SIS的前向和生成过程。该模型通过标签扩散(Label Diffusion)对语义标签图进行随机扰动增强鲁棒性,该机制利用离散扩散过程扩散标签。随着时间步长增加,噪声语义图与干净语义图逐渐相似,最终在$t=T$时完全一致。这有助于生成接近干净图像的输出,实现鲁棒生成。此外,我们提出类别级噪声调度方案,可根据类别差异对标签进行差异化扩散。通过基准数据集(包括模拟真实应用中人为错误的创新实验设置)上的大量实验与分析,我们证明了该方法能生成高质量样本。