Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.
翻译:文本到图像扩散模型已被证明能有效解决多种图像编辑任务。然而,在场景中无缝重定位物体这一看似简单的任务却仍然极具挑战性。现有解决此问题的方法由于缺乏空间推理能力,往往难以在真实场景中可靠工作。本文提出一种无需训练的方法,称为 DiffUHaul,它利用局部化文本到图像模型的空间理解能力来完成物体拖拽任务。由于模型中物体表征存在固有的纠缠性,盲目操纵局部化模型的布局输入往往导致编辑性能低下。为此,我们首先在每一步去噪过程中应用注意力掩码,使不同物体间的生成更加解耦,并采用自注意力共享机制以保持高层次物体外观。此外,我们提出一种新的扩散锚定技术:在早期去噪步骤中,我们对源图像与目标图像间的注意力特征进行插值,以将新布局与原始外观平滑融合;在后期去噪步骤中,我们将源图像的局部化特征传递至插值图像,以保留细粒度的物体细节。为使 DiffUHaul 适应真实图像编辑,我们应用了 DDPM 自注意力分桶技术,该技术能利用局部化模型更好地重建真实图像。最后,我们为此任务引入了一个自动化评估流程,并展示了本方法的有效性。我们的结果通过用户偏好研究得到了进一步验证。