Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.
翻译:个性化文本到图像生成模型使用户能够创建描绘其个人物品在不同场景中的图像,在多个领域具有应用价值。为实现个性化能力,现有方法依赖于在用户自定义数据集上对文本到图像基础模型进行微调,这对普通用户而言可能具有挑战性、资源密集且耗时。尽管已有研究尝试开发免微调方法,但其生成质量远低于基于微调的方法。本文提出联合图像扩散(JeDi),一种用于学习免微调个性化模型的有效技术。我们的核心思想是学习共享同一主体的多个相关文本-图像对的联合分布。为促进学习,我们提出一种可扩展的合成数据集生成技术。训练完成后,我们的模型通过在采样过程中简单使用参考图像作为输入,即可在测试时实现快速便捷的个性化。该方法无需任何昂贵的优化过程或附加模块,并能忠实保留任意数量参考图像所表征的主体身份。实验结果表明,我们的模型在定量和定性评估中均达到最先进的生成质量,显著优于先前基于微调和免微调的个性化基线方法。