The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP). However, text prompts have limitations when it comes to incorporating implicit information from reference images. Existing methods have attempted to address this limitation by employing expensive training procedures involving millions of training samples for image-to-image generation. In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.
翻译:Stable Diffusion模型作为一种基于文本提示(text prompt)的文本到图像生成模型,依赖对比语言-图像预训练(CLIP)模型对文本输入进行编码。然而,在融合参考图像的隐式信息时,文本提示存在固有局限。现有方法尝试通过引入包含数百万训练样本的昂贵训练流程来实现图像到图像的生成,以突破这一局限。本文则证明了Stable Diffusion中使用的CLIP模型本身具备将图像瞬间转换为文本提示的能力——这种图像到提示词的转换只需通过计算一个闭式线性投影矩阵即可实现。此外,研究表明该能力可通过两种方式进一步增强:利用少量同域训练数据(约100张图像),或在参考图像上执行若干在线训练步骤(约30次迭代)。基于这些方法,本文提出了一种简单灵活的解决方案,有效弥合了图像与文本提示之间的鸿沟。该技术可应用于图像变体生成、图像编辑等多种任务,促进图像与文本提示之间更高效、更自然的交互。