While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning (~6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms or performs on par with several baselines and concurrent works in both qualitative and quantitative evaluations while being memory and computationally efficient.
翻译:尽管生成模型能够从大规模数据集中学习并生成高质量的概念图像,但用户通常希望合成自己专属概念(例如家人、宠物或物品)的具体实例。我们能否仅通过少量示例,让模型快速习得一个新概念?更进一步,能否将多个新概念组合在一起?为此,我们提出Custom Diffusion——一种用于增强现有文本到图像模型的高效方法。研究发现,仅优化文本到图像条件机制中的少量参数,便足以表征新概念,同时实现快速调优(约6分钟)。此外,我们能够联合训练多个概念,或通过闭式约束优化将多个微调模型合并为一个。经微调的模型可生成多个新概念的不同变体,并将其与现有概念无缝融合于新场景中。在定性和定量评估中,我们的方法在保持内存与计算效率的同时,性能优于或持平于多个基线模型及同期工作。