Denoising Diffusion models have shown remarkable performance in generating diverse, high quality images from text. Numerous techniques have been proposed on top of or in alignment with models like Stable Diffusion and Imagen that generate images directly from text. A lesser explored approach is DALLE-2's two step process comprising a Diffusion Prior that generates a CLIP image embedding from text and a Diffusion Decoder that generates an image from a CLIP image embedding. We explore the capabilities of the Diffusion Prior and the advantages of an intermediate CLIP representation. We observe that Diffusion Prior can be used in a memory and compute efficient way to constrain the generation to a specific domain without altering the larger Diffusion Decoder. Moreover, we show that the Diffusion Prior can be trained with additional conditional information such as color histogram to further control the generation. We show quantitatively and qualitatively that the proposed approaches perform better than prompt engineering for domain specific generation and existing baselines for color conditioned generation. We believe that our observations and results will instigate further research into the diffusion prior and uncover more of its capabilities.
翻译:去噪扩散模型在从文本生成多样化、高质量图像方面展现了卓越性能。许多技术已在直接由文本生成图像的模型(如Stable Diffusion和Imagen)之上或与之对齐而提出。一个较少被探索的方法是DALL-E 2的两步流程,该流程包含一个从文本生成CLIP图像嵌入的扩散先验,以及一个从CLIP图像嵌入生成图像的扩散解码器。我们探索了扩散先验的能力以及中间CLIP表示的优势。我们观察到,扩散先验可以以一种节省内存和计算的方式,在不修改更大的扩散解码器的情况下,将生成约束到特定领域。此外,我们证明扩散先验可以额外训练条件信息(如颜色直方图)以进一步控制生成过程。我们通过定量和定性分析表明,所提出的方法在特定领域生成任务中优于提示工程,在颜色条件生成任务中优于现有基线。我们相信,这些观察结果和结论将推动对扩散先验的进一步研究,并揭示其更多潜能。