Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with \emph{in-context} learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by {\em apprenticeship learning}, where a single apprentice model is learned from data generated by massive amount of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train massive amount of expert models specialized on different subjects. The apprentice model SuTI then learns to mimic the behavior of these experts through the proposed apprenticeship learning algorithm. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI can significantly outperform existing approaches like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen while performing on par with DreamBooth.
翻译:近年来,像DreamBooth这样的文本到图像生成模型,通过针对给定主题从少量示例微调“专家模型”,在生成高度定制化的目标主题图像方面取得了显著进展。然而,这一过程成本高昂,因为每个主题都需要学习一个新的专家模型。在本文中,我们提出SuTI,一种主题驱动的文本到图像生成器,它用上下文学习替代了主题特定的微调。给定一个新主题的少量示例,SuTI可以即时生成该主题在不同场景中的新颖呈现,无需任何主题特定的优化。SuTI基于学徒学习,其中单个学徒模型从大量主题特定专家模型生成的数据中学习。具体而言,我们从互联网中挖掘数百万个图像簇,每个簇围绕一个特定的视觉主题。我们利用这些簇训练大量专精于不同主题的专家模型。然后,学徒模型SuTI通过所提出的学徒学习算法学习模仿这些专家的行为。与基于优化的最先进方法相比,SuTI生成高质量且定制化的主题特定图像速度提升20倍。在具有挑战性的DreamBench和DreamBench-v2基准上,我们的人工评估表明,SuTI能显著优于InstructPix2Pix、Textual Inversion、Imagic、Prompt2Prompt、Re-Imagen等现有方法,同时达到与DreamBooth相当的性能。