We introduce an Extended Textual Conditioning space in text-to-image models, referred to as $P+$. This space consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-net of the diffusion model. We show that the extended space provides greater disentangling and control over image synthesis. We further introduce Extended Textual Inversion (XTI), where the images are inverted into $P+$, and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space. The extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions. We conduct a series of extensive experiments to analyze and understand the properties of the new space, and to showcase the effectiveness of our method for personalizing text-to-image models. Furthermore, we utilize the unique properties of this space to achieve previously unattainable results in object-style mixing using text-to-image models. Project page: https://prompt-plus.github.io
翻译:我们引入了一个文本到图像模型中的扩展文本条件空间,称为$P+$。该空间由多个文本条件组成,这些条件来源于逐层提示,每个条件对应扩散模型去噪U-net的一个层。我们证明,该扩展空间在图像合成中提供了更强的解耦能力和控制能力。我们进一步引入了扩展文本反转(XTI),其中图像被反转到$P+$中,并通过逐层标记表示。我们表明,XTI比原始文本反转(TI)空间更具表达力和精确性,且收敛速度更快。该扩展反转方法在重建与可编辑性之间没有任何明显的权衡,并能诱导出更规则的反转。我们进行了一系列广泛的实验,以分析和理解新空间的属性,并展示我们的方法在个性化文本到图像模型中的有效性。此外,我们利用该空间的独特属性,在文本到图像模型中实现了之前无法达到的对象风格混合结果。项目页面:https://prompt-plus.github.io