Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes such as material, style, and layout remains a challenge, leading to a lack of disentanglement and editability. To address this problem, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low to high frequency information, providing a new perspective on representing, generating, and editing images. We develop the Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called \sysname. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer better disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image-guided or text-driven manipulations of materials, style, and layout, achieving previously unattainable results from a single image input without fine-tuning the diffusion models. Our source code is available athttps://github.com/zyxElsa/ProSpect.
翻译:个性化生成模型为用户提供的参考图像引导生成提供了一种方式。当前的个性化方法可以将物体或概念反转至文本条件空间,并为文本到图像扩散模型组合新的自然句子。然而,表示和编辑具体视觉属性(如材质、风格和布局)仍然存在挑战,导致缺乏解耦性与可编辑性。为解决这一问题,我们提出了一种新颖方法,利用扩散模型逐步生成的过程(图像从低频到高频信息生成),为表示、生成和编辑图像提供了新视角。我们开发了提示频谱空间P*(一种扩展的文本条件空间),以及一种名为\sysname的新图像表示方法。ProSpect将图像表示为从阶段提示编码得到的反转文本标记嵌入的集合,其中每个阶段提示对应扩散模型的特定生成阶段(即连续步骤组)。实验结果表明,与现有方法相比,P*和ProSpect提供了更好的解耦性与可控性。我们将ProSpect应用于多种个性化属性感知图像生成任务,例如基于图像引导或文本驱动的材质、风格和布局操作,无需微调扩散模型即可从单个图像输入实现此前无法达到的结果。我们的源代码可在https://github.com/zyxElsa/ProSpect获取。