Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.
翻译:近期,像DreamBooth这样的文本到图像生成模型通过针对给定主题微调“专家模型”,在从少量示例生成高度定制化目标主题图像方面取得了显著进展。然而,这一过程成本高昂,因为每个主题都需要学习一个新的专家模型。本文提出SuTI——一种主题驱动的文本到图像生成器,它用上下文学习替代了主题特定的微调。给定一个新主题的少量示例,SuTI无需任何主题特定优化即可即时生成该主题在不同场景中的全新演绎。SuTI基于学徒学习,即通过大量主题特定专家模型生成的数据来训练一个单一的学徒模型。具体而言,我们从互联网挖掘数百万个图像簇,每个簇围绕一个特定视觉主题。我们利用这些簇训练大量专家模型,每个模型专注于不同主题。随后,学徒模型SuTI学习模仿这些微调专家的行为。与基于优化的最先进方法相比,SuTI生成高质量定制化主题图像的速度快20倍。在具有挑战性的DreamBench和DreamBench-v2基准测试中,我们的人工评估表明,SuTI在主题对齐和文本对齐方面显著优于现有模型,如InstructPix2Pix、Textual Inversion、Imagic、Prompt2Prompt、Re-Imagen和DreamBooth。