In this report, we present MagicEdit, a surprisingly simple yet effective solution to the text-guided video editing task. We found that high-fidelity and temporally coherent video-to-video translation can be achieved by explicitly disentangling the learning of content, structure and motion signals during training. This is in contradict to most existing methods which attempt to jointly model both the appearance and temporal representation within a single framework, which we argue, would lead to degradation in per-frame quality. Despite its simplicity, we show that MagicEdit supports various downstream video editing tasks, including video stylization, local editing, video-MagicMix and video outpainting.
翻译:本报告中,我们提出MagicEdit,一种针对文本引导视频编辑任务的惊人简单且有效的解决方案。研究发现,通过在训练过程中显式解耦内容、结构和运动信号的学习,能够实现高保真且时间连贯的视频到视频转换。这与大多数现有方法不同——后者试图在单一框架内联合建模外观与时间表征,我们认为这会导致逐帧质量下降。尽管方法简洁,我们展示了MagicEdit支持多种下游视频编辑任务,包括视频风格化、局部编辑、视频MagicMix和视频外延。