Text-driven image and video editing can be naturally cast as inpainting problems, where masked regions are reconstructed to remain consistent with both the observed content and the editing prompt. Recent advances in test-time guidance for diffusion and flow models provide a principled framework for this task; however, existing methods rely on costly vector--Jacobian product (VJP) computations to approximate the intractable guidance term, limiting their practical applicability. Building upon the recent work of Moufad et al. (2025), we provide theoretical insights into their VJP-free approximation and substantially extend their empirical evaluation to large-scale image and video editing benchmarks. Our results demonstrate that test-time guidance alone can achieve performance comparable to, and in some cases surpass, training-based methods.
翻译:文本驱动的图像与视频编辑可自然转化为修复问题,其目标是在重建掩码区域时保持与观测内容及编辑指令的一致性。近期扩散模型与流模型在测试时引导方面的进展为此任务提供了理论框架;然而,现有方法依赖计算代价高昂的向量-雅可比积(VJP)来逼近难以处理的引导项,限制了其实用性。基于Moufad等人(2025)的最新研究,我们对其免VJP逼近方法提出了理论见解,并将其实证评估大幅扩展至大规模图像与视频编辑基准。实验结果表明,仅通过测试时引导即可达到与基于训练的方法相当的性能,并在部分场景中实现超越。