Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
翻译:基于文本到图像的扩散模型在图像编辑领域展现出巨大潜力,其中基于文本的编辑方法和对象拖拽方法已成为关键技术路径。然而,这些方法各自存在固有局限:基于文本的方法难以实现精确的对象定位,而对象拖拽方法则局限于静态的位置重定位。为解决这些问题,我们提出InstructUDrag——一个融合文本指令与对象拖拽的扩散模型框架,能够同步实现对象拖拽与基于文本的图像编辑。该框架将对象拖拽视为图像重建过程,并分解为两个协同作用的分支。移动重建分支利用基于能量的梯度引导实现对象的精确移动,通过优化交叉注意力图以提升重定位精度。文本驱动编辑分支与重建分支共享梯度信号,确保变换过程的一致性,并允许对对象属性进行细粒度控制。我们还采用DDPM反演技术,并将先验信息注入噪声图以保持移动对象的结构完整性。大量实验表明,InstructUDrag能够实现灵活且高保真的图像编辑,在对象重定位精度与图像内容语义控制方面均表现出优越性能。