Text-to-image diffusion models such as Stable Diffusion have recently attracted the interest of many researchers, and inverting the diffusion process can play an important role in better understanding the generative process and how to engineer prompts in order to obtain the desired images. To this end, we introduce the new task of predicting the text prompt given an image generated by a generative diffusion model. We combine a series of white-box and black-box models (with and without access to the weights of the diffusion network) to deal with the proposed task. We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective that generates improved prompts. To further improve our method, we employ a curriculum learning procedure that promotes the learning of image-prompt pairs with lower labeling noise (i.e. that are better aligned), and an unsupervised domain-adaptive kernel learning method that uses the similarities between samples in the source and target domains as extra features. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion. Our novel learning framework produces excellent results on the aforementioned task, yielding the highest gains when applied on the white-box model. In addition, we make an interesting discovery: training a diffusion model on the prompt generation task can make the model generate images that are much better aligned with the input prompts, when the model is directly reused for text-to-image generation.
翻译:文本到图像扩散模型(如Stable Diffusion)近期吸引了众多研究者的兴趣,而逆扩散过程对于深入理解生成机制以及如何设计提示词以获取目标图像具有重要作用。为此,我们提出一项新任务:根据生成式扩散模型生成的图像预测其文本提示词。我们结合一系列白盒模型与黑盒模型(分别考量是否可访问扩散网络权重)来处理该任务。我们提出一种新型学习框架,该框架融合了提示词回归与多标签词汇分类目标,能够生成更优的提示词。为进一步改进方法,我们采用了课程学习策略来增强学习过程中标注噪声较低(即对齐更佳)的图像-提示词对,并引入无监督域自适应核学习方法,将源域与目标域样本间的相似性作为额外特征。我们在DiffusionDB数据集上开展实验,从Stable Diffusion生成的图像中预测文本提示词。所提出的学习框架在上述任务中取得了优异结果,当应用于白盒模型时性能提升最为显著。此外,我们有一项有趣发现:在提示词生成任务上训练扩散模型后,若将其直接复用于文本到图像生成,该模型生成的图像与输入提示词的对齐程度将显著提升。