A Neural Space-Time Representation for Text-to-Image Personalization

A key aspect of text-to-image personalization methods is the manner in which the target concept is represented within the generative process. This choice greatly affects the visual fidelity, downstream editability, and disk space needed to store the learned concept. In this paper, we explore a new text-conditioning space that is dependent on both the denoising process timestep (time) and the denoising U-Net layers (space) and showcase its compelling properties. A single concept in the space-time representation is composed of hundreds of vectors, one for each combination of time and space, making this space challenging to optimize directly. Instead, we propose to implicitly represent a concept in this space by optimizing a small neural mapper that receives the current time and space parameters and outputs the matching token embedding. In doing so, the entire personalized concept is represented by the parameters of the learned mapper, resulting in a compact, yet expressive, representation. Similarly to other personalization methods, the output of our neural mapper resides in the input space of the text encoder. We observe that one can significantly improve the convergence and visual fidelity of the concept by introducing a textual bypass, where our neural mapper additionally outputs a residual that is added to the output of the text encoder. Finally, we show how one can impose an importance-based ordering over our implicit representation, providing users control over the reconstruction and editability of the learned concept using a single trained model. We demonstrate the effectiveness of our approach over a range of concepts and prompts, showing our method's ability to generate high-quality and controllable compositions without fine-tuning any parameters of the generative model itself.

翻译：文本到图像个性化生成方法的核心在于目标概念在生成过程中的表示方式。这种选择会显著影响视觉保真度、后续编辑能力以及存储所学概念所需的磁盘空间。本文探索了一种新的文本条件空间，该空间依赖于去噪过程的时间步（时间维度）和去噪U-Net的层级（空间维度），并展示了其引人瞩目的特性。时空表示中的一个概念由数百个向量组成，每个向量对应时间和空间的一个特定组合，这使得直接优化该空间极具挑战性。为此，我们提出通过优化一个小型神经映射器来隐式表示该空间中的概念——该映射器接收当前时间和空间参数，并输出对应的标记嵌入。通过这种方式，整个个性化概念由所学习映射器的参数表示，从而形成一种紧凑且富有表现力的表示。与其他个性化方法类似，我们的神经映射器输出位于文本编码器的输入空间。我们观察到，通过引入文本旁路（即让神经映射器额外输出一个残差并加至文本编码器的输出），可以显著提升概念的收敛速度和视觉保真度。最后，我们展示了如何对隐式表示施加基于重要性的排序，使用户能够通过单个训练模型控制所学概念的重建与可编辑性。我们通过一系列概念和提示验证了该方法的效果，表明其能够在无需微调生成模型本身任何参数的前提下，生成高质量且可控的合成结果。