Text-to-image diffusion models have achieved remarkable performance in image synthesis, while the text interface does not always provide fine-grained control over certain image factors. For instance, changing a single token in the text can have unintended effects on the image. This paper shows a simple modification of classifier-free guidance can help disentangle image factors in text-to-image models. The key idea of our method, Contrastive Guidance, is to characterize an intended factor with two prompts that differ in minimal tokens: the positive prompt describes the image to be synthesized, and the baseline prompt serves as a "baseline" that disentangles other factors. Contrastive Guidance is a general method we illustrate whose benefits in three scenarios: (1) to guide domain-specific diffusion models trained on an object class, (2) to gain continuous, rig-like controls for text-to-image generation, and (3) to improve the performance of zero-shot image editors.
翻译:文本到图像扩散模型在图像合成方面取得了显著性能,但文本接口并不总能对特定图像因素提供精细控制。例如,更改文本中的单个标记可能会对图像产生非预期影响。本文表明,对无分类器引导进行简单修改,有助于解耦文本到图像模型中的图像因素。我们方法的核心思想——对比引导——在于通过两个仅有最小标记差异的提示来表征目标因素:正向提示描述待合成图像,基线提示则作为解耦其他因素的“基准线”。对比引导是一种通用方法,我们通过三种场景展示其优势:(1) 引导在物体类别上训练的领域特定扩散模型,(2) 为文本到图像生成提供连续类骨骼控制,以及(3) 提升零样本图像编辑器的性能。