Editing on the Generative Manifold: A Theoretical and Empirical Study of General Diffusion-Based Image Editing Trade-offs

Diffusion-based editing has rapidly evolved from curated inpainting tools into general-purpose editors spanning text-guided instruction following, mask-localized edits, drag-based geometric manipulation, exemplar transfer, and training-free composition systems. Despite strong empirical progress, the field lacks a unified treatment of core desiderata that govern practical usability: controllability (how precisely and continuously the user can specify an edit), faithfulness to user intent (semantic alignment to instructions), semantic consistency (preservation of identity and non-target content), locality (containment of changes), and perceptual quality (artifact suppression and detail retention). This paper provides a theoretical and empirical analysis of general diffusion-based image editing, connecting diverse paradigms through a common view of editing as guided transport on a learned image manifold. We first formalize editing as an operator induced by a conditional reverse-time generative process and define task-agnostic metrics capturing instruction adherence, region preservation, semantic consistency, and stability under repeated edits. We then develop theory describing edit dynamics under (i) noise-injection and denoising transport, (ii) inversion-and-edit pipelines and the propagation of inversion errors, and (iii) locality constraints implemented via masked guidance or hard constraints. Under mild Lipschitz assumptions on the learned score or flow field, we derive bounds connecting guidance strength and inversion error to measurable deviations in non-target regions, and we characterize accumulation effects under iterative multi-turn editing. Empirically, we benchmark representative paradigms.

翻译：基于扩散模型的编辑技术已从精心设计的修复工具快速演进为通用编辑器，涵盖文本引导指令跟随、掩码定位编辑、拖拽式几何操作、范例迁移及免训练组合系统。尽管实证进展显著，该领域仍缺乏对决定实际可用性的核心需求的统一处理：可控性（用户指定编辑的精度与连续性）、用户意图忠实度（与指令的语义对齐）、语义一致性（身份与非目标内容的保持）、局部性（变更的约束范围）以及感知质量（伪影抑制与细节保留）。本文对通用扩散图像编辑进行理论与实证分析，通过将编辑视为学习图像流形上的引导传输这一统一视角，连接不同范式。我们首先将编辑形式化为条件反向时间生成过程诱导的算子，并定义与任务无关的指标，用于衡量指令遵循度、区域保持性、语义一致性及重复编辑下的稳定性。继而发展理论描述以下编辑动态：(i)噪声注入与去噪传输，(ii)反转-编辑流水线及反转误差传播，(iii)通过掩码引导或硬约束实现的局部性限制。在学习评分函数或流场的利普希茨假设下，我们推导出将引导强度与反转误差与非目标区域可测量偏差相关联的边界，并刻画迭代多轮编辑下的累积效应。在实证层面，我们对代表性范式进行基准测试。