Diffusion probabilistic models have achieved enormous success in the field of image generation and manipulation. In this paper, we explore a novel paradigm of using the diffusion model and classifier guidance in the latent semantic space for compositional visual tasks. linear fashion. Specifically, we train latent diffusion models and auxiliary latent classifiers to facilitate non-linear navigation of latent representation generation for any pre-trained generative model with a semantic latent space. We demonstrate that such conditional generation achieved by latent classifier guidance provably maximizes a lower bound of the conditional log probability during training. To maintain the original semantics during manipulation, we introduce a new guidance term, which we show is crucial for achieving compositionality. With additional assumptions, we show that the non-linear manipulation reduces to a simple latent arithmetic approach. We show that this paradigm based on latent classifier guidance is agnostic to pre-trained generative models, and present competitive results for both image generation and sequential manipulation of real and synthetic images. Our findings suggest that latent classifier guidance is a promising approach that merits further exploration, even in the presence of other strong competing methods.
翻译:扩散概率模型在图像生成与处理领域取得了巨大成功。本文探索了一种新颖范式,即在潜在语义空间中利用扩散模型与分类器引导实现组合视觉任务。具体而言,我们训练潜在扩散模型与辅助潜在分类器,以促进任意具有语义潜在空间的预训练生成模型的潜在表示生成的非线性导航。我们证明,通过潜在分类器引导实现的这种条件生成,能够在训练过程中可证明地最大化条件对数概率的下界。为在处理过程中保持原始语义,我们引入了一个新的引导项,并证明其对实现组合性至关重要。在附加假设下,我们证明了非线性处理可简化为简单的潜在算术方法。我们还表明,这种基于潜在分类器引导的范式与预训练生成模型无关,并在真实与合成图像的生成及序列处理中展现出具有竞争力的结果。我们的发现表明,即使存在其他强有力的竞争方法,潜在分类器引导仍是一种值得深入探索的有前景方法。