Denoising Diffusion models have shown remarkable capabilities in generating realistic, high-quality and diverse images. However, the extent of controllability during generation is underexplored. Inspired by techniques based on GAN latent space for image manipulation, we train a diffusion model conditioned on two latent codes, a spatial content mask and a flattened style embedding. We rely on the inductive bias of the progressive denoising process of diffusion models to encode pose/layout information in the spatial structure mask and semantic/style information in the style code. We propose two generic sampling techniques for improving controllability. We extend composable diffusion models to allow for some dependence between conditional inputs, to improve the quality of generations while also providing control over the amount of guidance from each latent code and their joint distribution. We also propose timestep dependent weight scheduling for content and style latents to further improve the translations. We observe better controllability compared to existing methods and show that without explicit training objectives, diffusion models can be used for effective image manipulation and image translation.
翻译:去噪扩散模型在生成逼真、高质量且多样化的图像方面展现了卓越的能力。然而,生成过程中的可控程度尚未得到充分探索。受基于GAN潜空间进行图像编辑技术的启发,我们训练了一个以两种潜码为条件的扩散模型:空间内容掩码与扁平化风格嵌入。我们利用扩散模型渐进式去噪过程的归纳偏置,将姿态/布局信息编码至空间结构掩码中,将语义/风格信息编码至风格码中。我们提出了两种通用采样技术以增强可控性。为提升生成质量并同时控制各潜码引导强度及其联合分布,我们扩展了可组合扩散模型,允许条件输入之间存在一定依赖性。此外,我们还提出了基于时间步的内容与风格潜码权重调度策略,以进一步改善图像变换效果。实验表明,相较于现有方法,本模型实现了更强的可控性,并且即使没有显式训练目标,扩散模型仍可用于有效的图像编辑与图像翻译任务。