The rapid advancement of pretrained text-driven diffusion models has significantly enriched applications in image generation and editing. However, as the demand for personalized content editing increases, new challenges emerge especially when dealing with arbitrary objects and complex scenes. Existing methods usually mistakes mask as the object shape prior, which struggle to achieve a seamless integration result. The mostly used inversion noise initialization also hinders the identity consistency towards the target object. To address these challenges, we propose a novel training-free framework that formulates personalized content editing as the optimization of edited images in the latent space, using diffusion models as the energy function guidance conditioned by reference text-image pairs. A coarse-to-fine strategy is proposed that employs text energy guidance at the early stage to achieve a natural transition toward the target class and uses point-to-point feature-level image energy guidance to perform fine-grained appearance alignment with the target object. Additionally, we introduce the latent space content composition to enhance overall identity consistency with the target. Extensive experiments demonstrate that our method excels in object replacement even with a large domain gap, highlighting its potential for high-quality, personalized image editing.
翻译:预训练文本驱动扩散模型的快速发展极大地丰富了图像生成与编辑的应用。然而,随着对个性化内容编辑需求的增长,尤其是在处理任意对象和复杂场景时,新的挑战随之出现。现有方法通常将掩码误认为对象形状先验,难以实现无缝的融合结果。最常用的反转噪声初始化也阻碍了与目标对象身份一致性的保持。为应对这些挑战,我们提出了一种新颖的无训练框架,该框架将个性化内容编辑表述为在潜在空间中对待编辑图像的优化,并以扩散模型作为由参考文本-图像对条件化的能量函数引导。我们提出了一种由粗到精的策略:在早期阶段采用文本能量引导以实现向目标类别的自然过渡,并使用点对点的特征级图像能量引导来执行与目标对象的细粒度外观对齐。此外,我们引入了潜在空间内容组合以增强与目标的整体身份一致性。大量实验表明,即使存在较大的领域差距,我们的方法在对象替换任务上依然表现出色,突显了其在高质量、个性化图像编辑方面的潜力。