There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.
翻译:个性化图像合成技术在Textual Inversion、DreamBooth及LoRA等方法推动下取得显著进展,然而其实际应用受限于高存储需求、冗长微调流程以及多张参考图像的依赖。现有基于身份嵌入的方法虽仅需单次前向推理,却面临挑战:或需对大量模型参数进行深度微调,或与社区预训练模型兼容性不足,或无法维持高保真面部特征。针对上述局限,我们提出InstantID——一种基于扩散模型的强大解决方案。该即插即用模块仅凭单张面部图像即可实现多种风格的图像个性化处理,同时确保高保真度。为此,我们设计了新颖的IdentityNet,通过施加强语义条件与弱空间条件,将面部图像、关键点图像与文本提示相结合以引导图像生成。InstantID展现出卓越的性能与效率,在身份保持至关重要的实际应用中极具价值。此外,本工作可与SD1.5、SDXL等主流预训练文本到图像扩散模型无缝集成,作为适应性插件使用。我们的代码与预训练检查点将于https://github.com/InstantID/InstantID 公开提供。