Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.
翻译:基于扩散的编辑模型已成为语义图像与视频处理的有力工具。然而,现有模型缺乏对文本引导编辑强度的平滑控制机制。在标准文本条件生成中,无分类器引导(CFG)会影响提示符的遵循程度,这暗示其可作为编辑模型中编辑强度的潜在控制手段。但我们发现,在这些模型中缩放CFG并不能在输入与编辑结果之间产生平滑过渡。我们将此现象归因于无条件预测:其作为引导源在低引导尺度下主导生成过程,同时代表对输入内容的任意操作。为实现连续控制,我们提出自适应源引导(AdaOr)方法,该方法通过身份条件自适应源来调整标准引导源,其中自适应源使用与身份操作相对应的身份指令。通过根据编辑强度将此身份预测与标准无条件预测进行插值,我们确保了从输入到编辑结果的连续过渡。我们在图像与视频编辑任务上评估了所提方法,结果表明相较于当前基于滑块的编辑方法,本方法能提供更平滑、更一致的控制效果。本方法将身份指令整合至标准训练框架,可在推理阶段实现细粒度控制,无需针对每次编辑进行专门处理或依赖特定数据集。