We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.
翻译:摘要:本文提出CatVersion,一种基于逆向方法的个性化概念学习技术,仅需少量样本即可完成概念习得。用户随后可通过文本提示生成蕴含该个性化概念的图像,从而实现文本到图像的个性化生成。与现有强调词嵌入学习或扩散模型参数微调的方法不同——这些方法易导致概念稀释或过拟合——本方法在扩散模型文本编码器的特征密集空间中执行嵌入拼接,以学习个性化概念与其基类之间的差异,旨在最大化保留扩散模型先验知识的同时恢复个性化概念。为此,我们首先剖析文本编码器在图像生成过程中的集成机制以定位编码器的特征密集空间,随后在该空间的键和值上拼接嵌入以学习个性化概念与基类之间的差异。通过这种方式,拼接后的嵌入最终以残差形式作用于原始注意力输出。为更准确无偏地量化个性化图像生成结果,我们基于掩码改进了CLIP图像对齐评分。定性与定量实验表明,CatVersion能够更忠实地还原个性化概念,并实现更鲁棒的图像编辑。