Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/
翻译:大型文本到图像模型在AI发展进程中实现了显著飞跃,能够根据给定文本提示生成高质量且多样化的图像。然而,这些模型缺乏模仿参考集中主体外观并在不同情境下生成其新颖演绎版本的能力。本文提出了一种针对文本到图像扩散模型的"个性化"新方法。仅需输入某个主体的少量图像,我们微调预训练文本到图像模型,使其学习将唯一标识符与特定主体绑定。一旦该主体嵌入模型输出域,该唯一标识符即可用于在不同场景中合成该主体的新颖逼真图像。通过利用模型中蕴含的语义先验,并结合新的自动类特定先验保留损失,我们的技术能够合成参考图像中未出现的多样化场景、姿态、视角及光照条件下的主体。我们将该技术应用于若干此前难以攻克的任务,包括主体重语境化、文本引导视角合成及艺术渲染,同时保持主体的关键特征。我们还为这一主体驱动生成的新任务提供了一套新数据集与评估协议。项目页面:https://dreambooth.github.io/