Denoising Diffusion models have shown remarkable performance in generating diverse, high quality images from text. Numerous techniques have been proposed on top of or in alignment with models like Stable Diffusion and Imagen that generate images directly from text. A lesser explored approach is DALLE-2's two step process comprising a Diffusion Prior that generates a CLIP image embedding from text and a Diffusion Decoder that generates an image from a CLIP image embedding. We explore the capabilities of the Diffusion Prior and the advantages of an intermediate CLIP representation. We observe that Diffusion Prior can be used in a memory and compute efficient way to constrain the generation to a specific domain without altering the larger Diffusion Decoder. Moreover, we show that the Diffusion Prior can be trained with additional conditional information such as color histogram to further control the generation. We show quantitatively and qualitatively that the proposed approaches perform better than prompt engineering for domain specific generation and existing baselines for color conditioned generation. We believe that our observations and results will instigate further research into the diffusion prior and uncover more of its capabilities.
翻译:去噪扩散模型在根据文本生成多样且高质量图像方面展现出了卓越性能。针对稳定扩散(Stable Diffusion)、图像生成(Imagen)等直接从文本生成图像的模型,研究者已提出众多增强技术。然而,较少被探索的方法是DALLE-2的两阶段过程:先由扩散先验(Diffusion Prior)根据文本生成CLIP图像嵌入,再由扩散解码器(Diffusion Decoder)通过该嵌入生成图像。我们深入探究了扩散先验的能力及中间层CLIP表征的优势。研究发现,采用内存与计算高效的方式利用扩散先验,可在不修改大型扩散解码器的前提下将生成过程约束至特定领域。此外,我们证明扩散先验可通过颜色直方图等额外条件信息进行训练,从而进一步控制生成结果。定量与定性实验表明,所提方法在领域特定生成任务中优于提示工程,在颜色条件生成任务中优于现有基线。我们相信,这些观察与结论将推动学界对扩散先验的进一步研究,揭示其更多潜力。