Zero-shot subject-driven image generation aims to produce images that incorporate a subject from a given example image. The challenge lies in preserving the subject's identity while aligning with the text prompt which often requires modifying certain aspects of the subject's appearance. Despite advancements in diffusion model based methods, existing approaches still struggle to balance identity preservation with text prompt alignment. In this study, we conducted an in-depth investigation into this issue and uncovered key insights for achieving effective identity preservation while maintaining a strong balance. Our key findings include: (1) the design of the subject image encoder significantly impacts identity preservation quality, and (2) separating text and subject guidance is crucial for both text alignment and identity preservation. Building on these insights, we introduce a new approach called EZIGen, which employs two main strategies: a carefully crafted subject image Encoder based on the pretrained UNet of the Stable Diffusion model to ensure high-quality identity transfer, following a process that decouples the guidance stages and iteratively refines the initial image layout. Through these strategies, EZIGen achieves state-of-the-art results on multiple subject-driven benchmarks with a unified model and 100 times less training data. The demo page is available at: https://zichengduan.github.io/pages/EZIGen/index.html.
翻译:零样本主题驱动图像生成旨在根据给定示例图像生成包含特定主题的图像。其核心挑战在于保持主题身份特征的同时,与文本提示对齐——这通常需要修改主题外观的某些方面。尽管基于扩散模型的方法已取得进展,现有方法仍难以在身份保持与文本提示对齐之间实现平衡。本研究对该问题进行了深入探究,揭示了在保持强平衡性的同时实现有效身份保持的关键机制。我们的主要发现包括:(1) 主题图像编码器的设计显著影响身份保持质量;(2) 文本引导与主题引导的分离对文本对齐和身份保持都至关重要。基于这些发现,我们提出了名为EZIGen的新方法,其采用两大核心策略:基于Stable Diffusion模型预训练UNet精心构建的主题图像编码器以确保高质量身份迁移,以及通过解耦引导阶段并迭代优化初始图像布局的处理流程。通过这些策略,EZIGen使用统一模型和仅百分之一的训练数据,在多个主题驱动基准测试中取得了最先进的性能。演示页面详见:https://zichengduan.github.io/pages/EZIGen/index.html。