Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model

Text-to-image generative models have attracted rising attention for flexible image editing via user-specified descriptions. However, text descriptions alone are not enough to elaborate the details of subjects, often compromising the subjects' identity or requiring additional per-subject fine-tuning. We introduce a new framework called \textit{Paste, Inpaint and Harmonize via Denoising} (PhD), which leverages an exemplar image in addition to text descriptions to specify user intentions. In the pasting step, an off-the-shelf segmentation model is employed to identify a user-specified subject within an exemplar image which is subsequently inserted into a background image to serve as an initialization capturing both scene context and subject identity in one. To guarantee the visual coherence of the generated or edited image, we introduce an inpainting and harmonizing module to guide the pre-trained diffusion model to seamlessly blend the inserted subject into the scene naturally. As we keep the pre-trained diffusion model frozen, we preserve its strong image synthesis ability and text-driven ability, thus achieving high-quality results and flexible editing with diverse texts. In our experiments, we apply PhD to both subject-driven image editing tasks and explore text-driven scene generation given a reference subject. Both quantitative and qualitative comparisons with baseline methods demonstrate that our approach achieves state-of-the-art performance in both tasks. More qualitative results can be found at \url{https://sites.google.com/view/phd-demo-page}.

翻译：文本到图像生成模型因其能通过用户指定的描述实现灵活图像编辑而备受关注。然而，仅凭文本描述难以详尽阐述主体的细节，往往会导致主体身份信息受损或需要针对每个主体进行额外的微调。我们提出了一种名为"通过去噪进行粘贴、修复与融合"（PhD）的新框架，该框架在文本描述之外还利用示例图像来明确用户意图。在粘贴步骤中，采用现成的分割模型识别示例图像中用户指定的主体，随后将其插入背景图像中，作为同时捕捉场景上下文和主体身份的初始化。为确保生成或编辑图像的视觉连贯性，我们引入了一个修复与融合模块，引导预训练扩散模型将插入的主体无缝自然地融入场景。由于我们保持预训练扩散模型冻结不变，因此保留了其强大的图像合成能力和文本驱动能力，从而实现了高质量结果和多样文本的灵活编辑。在实验中，我们将PhD应用于主体驱动的图像编辑任务，并探索了给定参考主体下的文本驱动场景生成。与基线方法的定量和定性比较均表明，我们的方法在这两项任务中均达到了最先进的性能。更多定性结果可访问\url{https://sites.google.com/view/phd-demo-page}。