Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.
翻译:文本到图像(T2I)个性化允许用户通过将个人视觉概念与自然语言提示相结合来引导创意图像生成过程。近年来,基于编码器的技术已成为T2I个性化的一种有效新方法,减少了对多张图像和长训练时间的需求。然而,大多数现有编码器局限于单一类别领域,这限制了其处理多样化概念的能力。在本工作中,我们提出了一种域无关方法,无需任何专门数据集或关于个性化概念的先验信息。我们引入了一种新颖的基于对比的正则化技术,通过将预测令牌推向其最邻近的现有CLIP令牌,在保持对目标概念特征高保真度的同时,确保预测嵌入接近潜在空间的可编辑区域。我们的实验结果证明了该方法的有效性,并展示了学习到的令牌比未正则化模型预测的令牌更具语义性。这带来了更优的表示,在实现最先进性能的同时,比先前方法具有更高的灵活性。