DiffUHaul: A Training-Free Method for Object Dragging in Images

Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.

翻译：文本到图像扩散模型已被证明能有效解决多种图像编辑任务。然而，在场景中无缝重定位物体这一看似简单的任务却仍然极具挑战性。现有解决此问题的方法由于缺乏空间推理能力，往往难以在真实场景中可靠工作。本文提出一种无需训练的方法，称为 DiffUHaul，它利用局部化文本到图像模型的空间理解能力来完成物体拖拽任务。由于模型中物体表征存在固有的纠缠性，盲目操纵局部化模型的布局输入往往导致编辑性能低下。为此，我们首先在每一步去噪过程中应用注意力掩码，使不同物体间的生成更加解耦，并采用自注意力共享机制以保持高层次物体外观。此外，我们提出一种新的扩散锚定技术：在早期去噪步骤中，我们对源图像与目标图像间的注意力特征进行插值，以将新布局与原始外观平滑融合；在后期去噪步骤中，我们将源图像的局部化特征传递至插值图像，以保留细粒度的物体细节。为使 DiffUHaul 适应真实图像编辑，我们应用了 DDPM 自注意力分桶技术，该技术能利用局部化模型更好地重建真实图像。最后，我们为此任务引入了一个自动化评估流程，并展示了本方法的有效性。我们的结果通过用户偏好研究得到了进一步验证。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日