Text-to-image (T2I) research has grown explosively in the past year, owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet, one pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. Moreover, as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details, hence necessitating more additional controls from the visual domain. In this paper, we take a bold step forward: taking "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users. Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to generate new images: it takes a reference image as "context", an optional image structural conditioning, and an initial noise, with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder), substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments, Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on, with promising quality. Our code and models are open-sourced at https://github.com/SHI-Labs/Prompt-Free-Diffusion.
翻译:文本到图像(T2I)研究在过去一年间呈爆发式增长,这得益于大规模预训练扩散模型以及众多新兴的个性化和编辑方法。然而,一个痛点仍然存在:文本提示工程,且为定制化结果寻找高质量的文本提示更像是一门艺术而非科学。此外,正如常言道:“一图胜千言”——用文本描述期望图像的尝试往往含糊不清,无法全面涵盖微妙的视觉细节,因此需要更多来自视觉领域的额外控制。在本文中,我们迈出了大胆的一步:将“文本”从预训练的T2I扩散模型中移除,以减少用户繁重的提示工程工作。我们提出的框架——免提示扩散——仅依赖视觉输入来生成新图像:它接受一张参考图像作为“上下文”,一个可选的图像结构条件,以及一个初始噪声,完全不需要任何文本提示。其背后的核心架构是语义上下文编码器(SeeCoder),它替代了常用的基于CLIP或基于LLM的文本编码器。SeeCoder的可重用性也使其成为一种便捷的即插即用组件:人们可以在一个T2I模型中预训练SeeCoder,并将其复用于另一个模型。通过大量实验,免提示扩散被实验证实:(i)在性能上超越了先前的基于示例的图像合成方法;(ii)与采用最佳实践提示的先进T2I模型性能相当;(iii)可自然地扩展到其他下游应用,如动漫角色生成和虚拟试穿,且质量令人满意。我们的代码和模型已在 https://github.com/SHI-Labs/Prompt-Free-Diffusion 开源。