Recent diffusion-based image editing methods commonly rely on text or high-level instructions to guide the generation process, offering intuitive but coarse control. In contrast, we focus on explicit, prompt-free editing, where the user directly specifies the modification by cropping and pasting an object or sub-object into a chosen location within an image. This operation affords precise spatial and visual control, yet it introduces a fundamental challenge: preserving the identity of the pasted object while harmonizing it with its new context. We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE, a saliency-guided modulation of rotational positional encoding (RoPE) that loosens the positional constraints to continuously control the attention field of view. By relaxing RoPE in this manner, our method smoothly steers the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object, enabling a balanced trade-off between identity retention and contextual blending. Our approach provides a flexible and intuitive framework for image editing, achieving seamless compositional results without textual descriptions or complex user input.
翻译:近期基于扩散模型的图像编辑方法通常依赖文本或高级指令引导生成过程,提供了直观但粗糙的控制方式。与之相对,我们聚焦于无需提示的显式编辑:用户通过裁剪粘贴目标对象或子对象至图像中选定位置,直接指定修改内容。这种操作虽能实现精准的空间与视觉控制,却带来根本性挑战——在保留粘贴对象身份特征的同时,使其与新环境和谐融合。我们观察到,扩散编辑模型中的注意力图本质上决定了图像区域是应被保留还是为保持一致性而进行调整。基于这一发现,我们提出LooseRoPE,一种面向旋转位置编码(RoPE)的显著性引导调制方法,通过放松位置约束实现对注意力视野的连续调控。通过这种方式松弛RoPE,我们的方法能够平滑引导模型在输入图像的忠实保留与插入对象的协调融合之间进行注意力分配,实现身份保持与上下文融合的平衡。本方法为图像编辑提供了灵活直观的框架,无需文本描述或复杂用户输入即可实现无缝合成效果。