Recent diffusion-based image editing methods commonly rely on text or high-level instructions to guide the generation process, offering intuitive but coarse control. In contrast, we focus on explicit, prompt-free editing, where the user directly specifies the modification by cropping and pasting an object or sub-object into a chosen location within an image. This operation affords precise spatial and visual control, yet it introduces a fundamental challenge: preserving the identity of the pasted object while harmonizing it with its new context. We observe that attention maps in diffusion-based editing models inherently govern whether image regions are preserved or adapted for coherence. Building on this insight, we introduce LooseRoPE, a saliency-guided modulation of rotational positional encoding (RoPE) that loosens the positional constraints to continuously control the attention field of view. By relaxing RoPE in this manner, our method smoothly steers the model's focus between faithful preservation of the input image and coherent harmonization of the inserted object, enabling a balanced trade-off between identity retention and contextual blending. Our approach provides a flexible and intuitive framework for image editing, achieving seamless compositional results without textual descriptions or complex user input.
翻译:近期基于扩散的图像编辑方法通常依赖文本或高层级指令引导生成过程,提供直观但粗略的控制。与此相对,我们专注于显式的、无需提示的编辑,用户通过裁剪并粘贴对象或子对象至图像中选定位置来直接指定修改。此操作提供了精确的空间与视觉控制,但也引入了根本性挑战:在将粘贴对象与其新语境协调的同时保持其身份一致性。我们观察到,基于扩散的编辑模型中的注意力图本质上控制着图像区域是被保留还是为保持连贯性而适配。基于这一洞见,我们提出LooseRoPE——一种基于显著性的旋转位置编码(RoPE)调制方法,通过放松位置约束来连续控制注意力的视野范围。以此方式松弛RoPE后,我们的方法能够平滑引导模型在忠实保留输入图像与连贯协调插入对象之间调节关注焦点,实现身份保持与语境融合的平衡权衡。本方法为图像编辑提供了灵活直观的框架,无需文本描述或复杂用户输入即可实现无缝的合成效果。