Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts. However, current personalization approaches struggle with lengthy training times, high storage requirements or loss of identity. To overcome these limitations, we propose an encoder-based domain-tuning approach. Our key insight is that by underfitting on a large set of concepts from a given domain, we can improve generalization and create a model that is more amenable to quickly adding novel concepts from the same domain. Specifically, we employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain, e.g. a specific face, and learns to map it into a word-embedding representing the concept. Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts. Together, these components are used to guide the learning of unseen concepts, allowing us to personalize a model using only a single image and as few as 5 training steps - accelerating personalization from dozens of minutes to seconds, while preserving quality.

翻译：文本到图像个性化旨在教会预训练的扩散模型推理用户提供的新概念，将其嵌入到由自然语言提示引导的新场景中。然而，当前的个性化方法存在训练时间长、存储需求高或身份特征丢失等问题。为克服这些限制，我们提出了一种基于编码器的领域调优方法。我们的核心见解在于，通过对给定领域中的大量概念进行欠拟合，可以提升泛化能力，并创建一个更易于快速添加同一领域新概念的模型。具体来说，我们采用两个组件：首先，一个编码器，它以给定领域（例如特定人脸）中目标概念的单张图像作为输入，并学习将其映射为表示该概念的词嵌入。其次，一组用于文本到图像模型的正则化权重偏移，用于学习如何有效引入额外概念。这些组件共同用于指导未见概念的学习，使我们能够仅使用单张图像和最少5个训练步骤即可个性化模型——将个性化时间从数十分钟缩短至数秒，同时保持生成质量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/