We introduce an Extended Textual Conditioning space in text-to-image models, referred to as $P+$. This space consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-net of the diffusion model. We show that the extended space provides greater disentangling and control over image synthesis. We further introduce Extended Textual Inversion (XTI), where the images are inverted into $P+$, and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space. The extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions. We conduct a series of extensive experiments to analyze and understand the properties of the new space, and to showcase the effectiveness of our method for personalizing text-to-image models. Furthermore, we utilize the unique properties of this space to achieve previously unattainable results in object-style mixing using text-to-image models. Project page: https://prompt-plus.github.io
翻译:我们在文本到图像模型中引入了一种扩展的文本条件空间,称为$P+$。该空间由多个文本条件组成,这些条件源自逐层提示,每个条件对应扩散模型去噪U-net的一层。我们证明,该扩展空间对图像合成提供了更强的解耦能力和控制力。我们进一步引入了扩展文本反转(XTI),其中图像被反转至$P+$空间,并由逐层令牌表示。研究表明,XTI比原始文本反转(TI)空间更具表现力和精确性,且收敛速度更快。该扩展反转方法在重构与可编辑性之间不存在明显权衡,并能诱导出更规则的反转结果。我们进行了一系列广泛实验,以分析和理解新空间的性质,并展示该方法在文本到图像模型个性化中的有效性。此外,我们利用该空间的独特性质,在文本到图像模型中实现了此前无法达到的对象风格混合结果。项目页面:https://prompt-plus.github.io