We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.
翻译:摘要:本文提出CatVersion,一种基于逆向优化的方法,通过少量示例学习个性化概念。用户随后可利用文本提示生成体现该个性化概念的图像,从而实现文本到图像的个性化生成。与现有强调词嵌入学习或扩散模型参数微调的方法(这些方法可能导致概念稀释或过拟合)不同,本方法在扩散模型文本编码器的特征密集空间中拼接嵌入,以学习个性化概念与其基类之间的差异,旨在最大化保留扩散模型先验知识的同时恢复个性化概念。为此,我们首先剖析文本编码器在图像生成过程中的集成机制,定位编码器的特征密集空间。随后,在该空间的键(Keys)和值(Values)上拼接嵌入,学习个性化概念与基类之间的差异。通过这种方式,拼接的嵌入最终表现为对原始注意力输出的残差修正。为更准确无偏地量化个性化图像生成结果,我们基于掩码改进了CLIP图像对齐分数。定性与定量实验表明,CatVersion能更忠实地恢复个性化概念,并实现更鲁棒的编辑操作。