The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP). However, text prompts have limitations when it comes to incorporating implicit information from reference images. Existing methods have attempted to address this limitation by employing expensive training procedures involving millions of training samples for image-to-image generation. In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.
翻译:稳定扩散模型是一种以文本提示词为输入的文本到图像生成模型,其文本提示词通过对比语言-图像预训练(CLIP)模型进行编码。然而,文本提示词在融入参考图像的隐含信息方面存在局限性。现有方法试图通过采用涉及数百万训练样本的昂贵训练过程来解决图像到图像生成中的这一局限。相比之下,本文证明稳定扩散中使用的CLIP模型本质上具备将图像即时转换为文本提示词的能力。这种图像到提示词的转换可通过使用以闭合形式计算的线性投影矩阵来实现。此外,本文展示该能力可通过利用少量同域训练数据(约100张图像)或在参考图像上执行若干在线训练步骤(约30次迭代)进一步增强。通过采用这些方法,所提出的方案为弥合图像与文本提示词之间的差距提供了简单而灵活的解决方案。该方法可应用于图像变体生成与图像编辑等多种任务,从而促进图像与文本提示词之间更有效且无缝的交互。