LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

翻译：依赖注意力机制进行隐式点匹配已成为拖拽式编辑的核心瓶颈，导致必须在弱化反演强度与高成本测试时优化之间做出根本性妥协。这种妥协严重限制了扩散模型的生成能力，抑制了高保真修复与文本引导创作。本文提出LazyDrag——首个面向多模态扩散Transformer的拖拽式图像编辑方法，该方法直接消除了对隐式点匹配的依赖。具体而言，我们的方法根据用户拖拽输入生成显式对应关系图，作为增强注意力控制的可靠参照。这一可靠参照为实现稳定的全强度反演过程创造了可能，这在拖拽式编辑任务中尚属首次。该方法无需测试时优化，充分释放了模型的生成潜力。因此，LazyDrag自然地将精确几何控制与文本引导相统一，实现了以往难以达成的复杂编辑：如张开狗嘴并修复其内部结构、生成“网球”等新物体，或针对模糊拖拽操作做出上下文感知的调整（例如将手移入衣袋）。此外，LazyDrag支持包含同步移动与缩放操作的多轮工作流程。在DragBench基准测试中，我们的方法在拖拽精度与感知质量方面均超越基线模型，VIEScore指标与人工评估结果均验证了其优越性。LazyDrag不仅确立了新的性能标杆，更为编辑范式开辟了新路径。