Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.
翻译:扩散模型在图像生成与编辑中展现出卓越性能,但其固有的随机性给图像内容与身份的保持及操控带来挑战。尽管DreamBooth和文本反演等先前方法通过模型或潜在表示个性化来维持内容,但它们依赖多张参考图像和复杂训练,限制了实用性。本文提出一种简单而高效的高度个性化(HiPer)文本嵌入方法,通过分解CLIP嵌入空间实现个性化与内容操控。该方法无需模型微调或标识符,仅凭单张图像和目标文本即可实现背景、纹理及运动的编辑。在多样化目标文本上的实验表明,我们的方法能在广泛任务中生成高度个性化且语义复杂的图像编辑。我们相信,本文对文本嵌入空间的新颖理解有望启发各领域进一步研究。