Large, text-conditioned generative diffusion models have recently gained a lot of attention for their impressive performance in generating high-fidelity images from text alone. However, achieving high-quality results is almost unfeasible in a one-shot fashion. On the contrary, text-guided image generation involves the user making many slight changes to inputs in order to iteratively carve out the envisioned image. However, slight changes to the input prompt often lead to entirely different images being generated, and thus the control of the artist is limited in its granularity. To provide flexibility, we present the Stable Artist, an image editing approach enabling fine-grained control of the image generation process. The main component is semantic guidance (SEGA) which steers the diffusion process along variable numbers of semantic directions. This allows for subtle edits to images, changes in composition and style, as well as optimization of the overall artistic conception. Furthermore, SEGA enables probing of latent spaces to gain insights into the representation of concepts learned by the model, even complex ones such as 'carbon emission'. We demonstrate the Stable Artist on several tasks, showcasing high-quality image editing and composition.
翻译:大型文本条件生成扩散模型近期因其仅凭文本就能生成高保真图像的卓越性能而备受关注。然而,一次性生成高质量结果几乎不可行。相反,文本引导的图像生成需要用户对输入进行多次微调,以逐步塑造出预期的图像。但输入提示的细微变化往往会导致生成完全不同的图像,因此艺术家的控制在粒度上受到限制。为了提供灵活性,我们提出了稳定艺术家(Stable Artist),一种图像编辑方法,能够实现对图像生成过程的精细控制。其主要组成部分是语义引导(SEGA),它沿着可变数量的语义方向引导扩散过程。这允许对图像进行微调、改变构图和风格,以及优化整体艺术构思。此外,SEGA还能探测潜在空间,以深入了解模型所学概念(即使是复杂概念如“碳排放”)的表示。我们在多个任务上展示了稳定艺术家,证明了其在高质量图像编辑和构图方面的能力。