We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters
翻译:我们提出 Emu Video Edit (EVE) 模型,该模型在无需依赖任何监督视频编辑数据的情况下,确立了视频编辑领域的最新最优性能。为开发 EVE,我们分别训练了图像编辑适配器和视频生成适配器,并将两者附加到同一文本到图像模型上。接着,为了将这些适配器对齐到视频编辑任务,我们引入了一种新的无监督蒸馏程序——因子化扩散蒸馏。该程序可在完全无监督数据的情况下,同时从一个或多个教师模型中蒸馏知识。我们利用该程序通过联合蒸馏知识来教会 EVE 编辑视频:(i)从图像编辑适配器中精确编辑每一帧图像,以及(ii)利用视频生成适配器确保编辑后帧间的时间一致性。最后,为展示本方法在解锁其他能力方面的潜力,我们对齐了更多适配器组合。