The CLIP Model is Secretly an Image-to-Prompt Converter

The Stable Diffusion model is a prominent text-to-image generation model that relies on a text prompt as its input, which is encoded using the Contrastive Language-Image Pre-Training (CLIP). However, text prompts have limitations when it comes to incorporating implicit information from reference images. Existing methods have attempted to address this limitation by employing expensive training procedures involving millions of training samples for image-to-image generation. In contrast, this paper demonstrates that the CLIP model, as utilized in Stable Diffusion, inherently possesses the ability to instantaneously convert images into text prompts. Such an image-to-prompt conversion can be achieved by utilizing a linear projection matrix that is calculated in a closed form. Moreover, the paper showcases that this capability can be further enhanced by either utilizing a small amount of similar-domain training data (approximately 100 images) or incorporating several online training steps (around 30 iterations) on the reference images. By leveraging these approaches, the proposed method offers a simple and flexible solution to bridge the gap between images and text prompts. This methodology can be applied to various tasks such as image variation and image editing, facilitating more effective and seamless interaction between images and textual prompts.

翻译：稳定扩散模型是一种以文本提示词为输入的文本到图像生成模型，其文本提示词通过对比语言-图像预训练（CLIP）模型进行编码。然而，文本提示词在融入参考图像的隐含信息方面存在局限性。现有方法试图通过采用涉及数百万训练样本的昂贵训练过程来解决图像到图像生成中的这一局限。相比之下，本文证明稳定扩散中使用的CLIP模型本质上具备将图像即时转换为文本提示词的能力。这种图像到提示词的转换可通过使用以闭合形式计算的线性投影矩阵来实现。此外，本文展示该能力可通过利用少量同域训练数据（约100张图像）或在参考图像上执行若干在线训练步骤（约30次迭代）进一步增强。通过采用这些方法，所提出的方案为弥合图像与文本提示词之间的差距提供了简单而灵活的解决方案。该方法可应用于图像变体生成与图像编辑等多种任务，从而促进图像与文本提示词之间更有效且无缝的交互。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日