Recent advances in personalized image generation allow a pre-trained text-to-image model to learn a new concept from a set of images. However, existing personalization approaches usually require heavy test-time finetuning for each concept, which is time-consuming and difficult to scale. We propose InstantBooth, a novel approach built upon pre-trained text-to-image models that enables instant text-guided image personalization without any test-time finetuning. We achieve this with several major components. First, we learn the general concept of the input images by converting them to a textual token with a learnable image encoder. Second, to keep the fine details of the identity, we learn rich visual feature representation by introducing a few adapter layers to the pre-trained model. We train our components only on text-image pairs without using paired images of the same concept. Compared to test-time finetuning-based methods like DreamBooth and Textual-Inversion, our model can generate competitive results on unseen concepts concerning language-image alignment, image fidelity, and identity preservation while being 100 times faster.
翻译:近期个性化图像生成的进展使得预训练的文本到图像模型能够从一组图像中学习新概念。然而,现有个性化方法通常需要对每个概念进行繁重的测试时微调,既耗时又难以扩展。我们提出InstantBooth,这是一种基于预训练文本到图像模型的新方法,无需任何测试时微调即可实现即时文本引导图像个性化。我们通过几个关键组件实现这一目标:首先,通过学习图像编码器将输入图像转换为文本标记,从而捕获图像的通用概念;其次,为保持身份的精细细节,我们在预训练模型中引入若干适配器层,以学习丰富的视觉特征表示。训练过程仅使用文本-图像对,无需同一概念的配对图像。与DreamBooth和Textual-Inversion等基于测试时微调的方法相比,本模型在语言-图像对齐、图像保真度和身份保持方面,针对未见概念可生成具有竞争力的结果,同时推理速度快100倍。