Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at https://github.com/SHI-Labs/Prompt-Free-Diffusion.

翻译：文本到图像（T2I）研究在过去一年间呈爆发式增长，这得益于大规模预训练扩散模型以及众多新兴的个性化和编辑方法。然而，一个痛点仍然存在：文本提示工程，且为定制化结果寻找高质量的文本提示更像是一门艺术而非科学。此外，正如常言道：“一图胜千言”——用文本描述期望图像的尝试往往含糊不清，无法全面涵盖微妙的视觉细节，因此需要更多来自视觉领域的额外控制。在本文中，我们迈出了大胆的一步：将“文本”从预训练的T2I扩散模型中移除，以减少用户繁重的提示工程工作。我们提出的框架——免提示扩散——仅依赖视觉输入来生成新图像：它接受一张参考图像作为“上下文”，一个可选的图像结构条件，以及一个初始噪声，完全不需要任何文本提示。其背后的核心架构是语义上下文编码器（SeeCoder），它替代了常用的基于CLIP或基于LLM的文本编码器。SeeCoder的可重用性也使其成为一种便捷的即插即用组件：人们可以在一个T2I模型中预训练SeeCoder，并将其复用于另一个模型。通过大量实验，免提示扩散被实验证实：（i）在性能上超越了先前的基于示例的图像合成方法；（ii）与采用最佳实践提示的先进T2I模型性能相当；（iii）可自然地扩展到其他下游应用，如动漫角色生成和虚拟试穿，且质量令人满意。我们的代码和模型已在 https://github.com/SHI-Labs/Prompt-Free-Diffusion 开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日