In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. We then reveal the controlling mechanisms of diffusion models, theoretically analyzing how novel conditions are introduced into the denoising process for conditional generation. Additionally, we offer a detailed overview of research in this area, organizing it into distinct categories from the condition perspective: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For an exhaustive list of the controllable generation literature surveyed, please refer to our curated repository at \url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}.
翻译:在快速发展的视觉生成领域中,扩散模型凭借其强大的文本引导生成能力,彻底改变了该领域的格局,标志着能力的重大突破。然而,仅依赖文本条件来控制这些模型,并不能完全满足不同应用和场景中复杂多变的需求。针对这一不足,大量研究致力于控制预训练的文本到图像(T2I)模型以支持新条件。本综述对基于T2I扩散模型的可控生成文献进行了全面梳理,涵盖该领域的理论基础与实践进展。我们首先简要介绍去噪扩散概率模型(DDPMs)的基本原理和广泛使用的T2I扩散模型,进而揭示扩散模型的控制机制,从理论上分析新条件如何被引入去噪过程以实现条件生成。此外,我们从条件视角对该领域的研究进行详细梳理,将其划分为特定条件生成、多条件生成和通用可控生成三个不同类别。关于本综述所调查的可控生成文献的完整列表,请参见我们维护的资源库:\url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}。