Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.
翻译:个性化生成模型提供了一种通过用户提供的参考来引导图像生成的途径。当前的个性化方法可以将对象或概念反转至文本条件空间,并为文本到图像扩散模型组合新的自然语句。然而,在表示和编辑诸如材质、风格、布局等特定视觉属性方面仍存在挑战,导致缺乏解耦性和可编辑性。为解决这一问题,我们提出了一种新方法,利用扩散模型从低频到高频信息逐步生成图像的过程,为图像表示、生成和编辑提供了新视角。我们开发了扩展的文本条件空间——提示谱空间P*,以及一种新的图像表示方法ProSpect。ProSpect将图像表示为从分阶段提示编码的反转文本令牌嵌入的集合,其中每个提示对应扩散模型的特定生成阶段(即连续步骤的组)。实验结果表明,与现有方法相比,P*和ProSpect提供了更强的解耦性和可控制性。我们将ProSpect应用于各类个性化的属性感知图像生成任务,如图像/文本引导的材质/风格/布局迁移/编辑,在无需微调扩散模型的情况下,仅凭单张图像输入便实现了此前无法达到的效果。