We introduce an Extended Textual Conditioning space in text-to-image models, referred to as $P+$. This space consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-net of the diffusion model. We show that the extended space provides greater disentangling and control over image synthesis. We further introduce Extended Textual Inversion (XTI), where the images are inverted into $P+$, and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space. The extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions. We conduct a series of extensive experiments to analyze and understand the properties of the new space, and to showcase the effectiveness of our method for personalizing text-to-image models. Furthermore, we utilize the unique properties of this space to achieve previously unattainable results in object-style mixing using text-to-image models. Project page: https://prompt-plus.github.io
翻译:我们提出了一种文本到图像模型中的扩展文本条件空间,称为 $P+$。该空间由多个文本条件组成,这些条件源自于各层提示,每个提示对应扩散模型去噪U-net的一个层。我们表明,该扩展空间为图像合成提供了更强的解耦能力和控制能力。我们进一步提出了扩展文本反演(XTI),其中图像被反演到 $P+$ 空间中,并由各层令牌表示。我们展示XTI比原始文本反演(TI)空间更具表达力和精确性,且收敛更快。该扩展反演方法在重建与可编辑性之间未引入任何明显的折中,并能诱导更规则的反演。我们进行了一系列广泛的实验,以分析和理解新空间的属性,并展示我们的方法在个性化文本到图像模型方面的有效性。此外,我们利用这一空间的独特属性,在文本到图像模型中实现了此前无法达成的对象-风格混合结果。项目页面:https://prompt-plus.github.io