We present Emu Video, a text-to-video generation model that factorizes the generation into two steps: first generating an image conditioned on the text, and then generating a video conditioned on the text and the generated image. We identify critical design decisions--adjusted noise schedules for diffusion, and multi-stage training--that enable us to directly generate high quality and high resolution videos, without requiring a deep cascade of models as in prior work. In human evaluations, our generated videos are strongly preferred in quality compared to all prior work--81% vs. Google's Imagen Video, 90% vs. Nvidia's PYOCO, and 96% vs. Meta's Make-A-Video. Our model outperforms commercial solutions such as RunwayML's Gen2 and Pika Labs. Finally, our factorizing approach naturally lends itself to animating images based on a user's text prompt, where our generations are preferred 96% over prior work.
翻译:我们提出Emu Video,一种将生成过程分解为两步的文本到视频生成模型:首先生成基于文本条件的图像,再生成基于文本与生成图像条件的视频。我们识别出关键设计决策——调整扩散模型的噪声调度与多阶段训练——使我们能够直接生成高质量、高分辨率视频,无需像先前工作那样依赖深度级联模型。人工评估显示,我们的生成视频在质量上显著优于所有先前工作:相比Google的Imagen Video偏好率达81%,相比Nvidia的PYOCO达90%,相比Meta的Make-A-Video达96%。我们的模型超越了RunwayML的Gen2与Pika Labs等商业解决方案。最后,我们的分解方法天然适用于基于用户文本提示对图像进行动画生成,在此任务中用户对我们生成的偏好率较先前工作高出96%。