Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications. Our project page is available at https://photo-maker.github.io/
翻译:近期文本到图像生成技术的进展在根据给定文本提示合成逼真人像方面取得了显著突破。然而,现有个性化生成方法难以同时满足高效率、高身份保真度与灵活文本可控性的要求。本文提出PhotoMaker——一种高效的个性化文本到图像生成方法,其核心在于将任意数量的输入身份图像编码为堆叠身份嵌入以保留身份信息。此类嵌入作为统一身份表征,不仅能全面封装同一输入身份的特征,还可容纳不同身份的特征以支持后续融合,为更具创新性与实用价值的应用奠定基础。此外,为驱动PhotoMaker的训练,我们提出面向身份的数据构建流水线以整合训练数据。得益于该流水线构建的数据集的滋养,PhotoMaker相比基于测试时微调的方法展现出更优的身份保存能力,同时具备显著的推理速度提升、高质量生成效果、强泛化能力及广泛的应用场景。项目页面见https://photo-maker.github.io/