Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely due to the difficulty of collecting relevant training data. Similarly, existing training-free methods are inherently restricted to structure- and motion-preserving edits and do not support modification of motion or interactions. Here, we introduce DynaEdit, a training-free editing method that unlocks versatile video editing capabilities with pretrained text-to-video flow models. Our method relies on the recently introduced inversion-free approach, which does not intervene in the model internals, and is thus model-agnostic. We show that naively attempting to adapt this approach to general unconstrained editing results in severe low-frequency misalignment and high-frequency jitter. We explain the sources for these phenomena and introduce novel mechanisms for overcoming them. Through extensive experiments, we show that DynaEdit achieves state-of-the-art results on complex text-based video editing tasks, including modifying actions, inserting objects that interact with the scene, and introducing global effects.
翻译:受控视频生成技术近年来取得了显著进展。然而,编辑真实世界视频中的动作与动态事件,或插入应影响其他对象行为的内容,仍是一项重大挑战。现有训练模型在处理复杂编辑时表现欠佳,这很可能源于相关训练数据收集的困难。同样,现有无训练方法本质上局限于保留结构与运动的编辑,无法支持对运动或交互的修改。为此,我们提出DynaEdit——一种无需训练的编辑方法,通过预训练文本到视频流模型实现多功能视频编辑能力。该方法基于最近引入的无反转框架,该框架不干预模型内部结构,因而具有模型无关性。我们证明,简单地将该方法应用于通用无约束编辑会导致严重的低频失调与高频抖动。我们阐释了这些现象的成因,并引入新型机制加以克服。通过大量实验,DynaEdit在基于文本的复杂视频编辑任务中取得了最先进成果,包括修改动作、插入与场景交互的对象以及引入全局效果。