Late-Constraint Diffusion Guidance for Controllable Image Synthesis

Diffusion models, either with or without text condition, have demonstrated impressive capability in synthesizing photorealistic images given a few or even no words. These models may not fully satisfy user need, as normal users or artists intend to control the synthesized images with specific guidance, like overall layout, color, structure, object shape, and so on. To adapt diffusion models for controllable image synthesis, several methods have been proposed to incorporate the required conditions as regularization upon the intermediate features of the diffusion denoising network. These methods, known as early-constraint ones in this paper, have difficulties in handling multiple conditions with a single solution. They intend to train separate models for each specific condition, which require much training cost and result in non-generalizable solutions. To address these difficulties, we propose a new approach namely late-constraint: we leave the diffusion networks unchanged, but constrain its output to be aligned with the required conditions. Specifically, we train a lightweight condition adapter to establish the correlation between external conditions and internal representations of diffusion models. During the iterative denoising process, the conditional guidance is sent into corresponding condition adapter to manipulate the sampling process with the established correlation. We further equip the introduced late-constraint strategy with a timestep resampling method and an early stopping technique, which boost the quality of synthesized image meanwhile complying with the guidance. Our method outperforms the existing early-constraint methods and generalizes better to unseen condition.

翻译：扩散模型，无论是否带有文本条件，已展现出在极少甚至无文字描述下合成逼真图像的惊人能力。这些模型可能无法完全满足用户需求，因为普通用户或艺术家倾向于通过特定引导（如整体布局、颜色、结构、物体形状等）来控制生成的图像。为适应扩散模型的可控图像合成，已有多种方法被提出，将所需条件作为正则项引入扩散去噪网络的中间特征。这些方法（本文称为早期约束方法）在单一解决方案中处理多重条件时存在困难。它们倾向于为每个特定条件训练单独的模型，这需要大量训练成本且导致不可泛化的解决方案。为解决这些问题，我们提出一种新方法，即晚期约束：我们保持扩散网络不变，但约束其输出与所需条件对齐。具体而言，我们训练一个轻量级条件适配器，以建立外部条件与扩散模型内部表示之间的相关性。在迭代去噪过程中，条件引导被送入相应的条件适配器，利用已建立的相关性操纵采样过程。我们进一步为所提出的晚期约束策略配备了时间步重采样方法和早停技术，在符合引导的同时提升合成图像的质量。我们的方法优于现有早期约束方法，并对未见条件具有更好的泛化能力。